Document parse failures need graceful recovery
Description
Attachments
Activity

Sanne Grinovero July 30, 2013 at 4:53 PM

Hardy Ferentschik July 30, 2013 at 2:25 PM
The underlying problems is that Document
creation and as part of this any bridge errors are not handled by the ErrorHandler
. At least for the mass indexing case this could be fixed by plugging the ErrorHandler
into the right spot in EntityConsumerLuceneWorkProducer
.

Haywood J. B. July 2, 2013 at 8:54 PM
To be clear, I was able to get the ErrorHandler implemented in terms of getting it hooked into the workflow. I just wasn't sure how to tell it to ignore the error and continue indexing as if nothing happened.
Thanks!

Hardy Ferentschik June 21, 2013 at 6:17 PM
I think using your own custom Tika bridge is a good work around for now. I'll have a closer look at this asap. It seems reasonable to hook the ErrorHandler in, but I need to have a closer look at the code.

Haywood J. B. June 21, 2013 at 2:13 PMEdited
I was unable to make the custom ErrorHandler solution work so I ended up copying and pasting the org.hibernate.search.bridge.builtin.TikaBridge into my codebase and modifying it to log parse errors but keep moving.
I ended up using the following annotations to bring it all together.
Details
Details
Assignee

Reporter

When using the mass indexer and a failure occurs to parse a document either the whole block of indexed documents gets thrown out or everything after the exception gets thrown out. I'm still trying to figure out if the any documents before the exception are indexed...I suspect not.
Example:
I start to index using the mass indexer and grab 20 classes and start indexing. The first 7 classes are fine and everything indexes properly. On the 8th class a document is found to be unable to be parsed by Tika which throws an exception. The entire 20 documents are not indexed???
It would be much more helpful to not throw a runtime when a document fails to parse and instead log a warning or something less halting.