We're updating the issue view to help you get more done. 

Provide ability to customize parser in TikaBridge

Description

As already discussed on GitHub (https://github.com/hibernate/hibernate-search/pull/1634)

While updating to a more recent version of Tika (1.17), we cannot rely on the default AutoDetectParser anymore, as we need to define a custom Tika instance with a special tika-config.xml to drop a specific parser (GrobIdParser, https://wiki.apache.org/tika/GrobidJournalParser) which, at least after my lib update, seems now the default for indexing PDFs. This is effectively suggested in

https://issues.apache.org/jira/browse/TIKA-2243

In my case, I had to provide a custom tika-config.xml:

1 2 3 4 5 6 7 8 9 10 11 12 <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime-exclude>application/pdf</mime-exclude> <parser-exclude class="org.apache.tika.parser.journal.JournalParser"/> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <mime>application/pdf</mime> </parser> </parsers> </properties>

To reflect this customization in the TikaBridge as well, some additional handle for this is required as the AutoDetectParser is hard-wired in TikaBridge.

Environment

None

Status

Assignee

Yoann Rodière

Reporter

Niko Wittenbeck

Labels

None

Suitable for new contributors

None

Feedback Requested

None

Components

Fix versions

Priority

Minor