Uploaded image for project: 'Hibernate Search'
  1. HSEARCH-3019

Provide ability to customize parser in TikaBridge

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.10.0.Beta1
    • Component/s: mapping
    • Labels:
      None

      Description

      As already discussed on GitHub (https://github.com/hibernate/hibernate-search/pull/1634)

      While updating to a more recent version of Tika (1.17), we cannot rely on the default AutoDetectParser anymore, as we need to define a custom Tika instance with a special tika-config.xml to drop a specific parser (GrobIdParser, https://wiki.apache.org/tika/GrobidJournalParser) which, at least after my lib update, seems now the default for indexing PDFs. This is effectively suggested in

      https://issues.apache.org/jira/browse/TIKA-2243

      In my case, I had to provide a custom tika-config.xml:

      <?xml version="1.0" encoding="UTF-8"?>
      <properties>
          <parsers>
              <parser class="org.apache.tika.parser.DefaultParser">
                  <mime-exclude>application/pdf</mime-exclude>
                  <parser-exclude class="org.apache.tika.parser.journal.JournalParser"/>
              </parser>
              <parser class="org.apache.tika.parser.pdf.PDFParser">
                  <mime>application/pdf</mime>
              </parser>
          </parsers>
      </properties>
      

      To reflect this customization in the TikaBridge as well, some additional handle for this is required as the AutoDetectParser is hard-wired in TikaBridge.

        Attachments

          Activity

            People

            • Assignee:
              yrodiere Yoann Rodière
              Reporter:
              nikowitt Niko Wittenbeck
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: