Document parse failures need graceful recovery

Description

When using the mass indexer and a failure occurs to parse a document either the whole block of indexed documents gets thrown out or everything after the exception gets thrown out. I'm still trying to figure out if the any documents before the exception are indexed...I suspect not.

Example:
I start to index using the mass indexer and grab 20 classes and start indexing. The first 7 classes are fine and everything indexes properly. On the 8th class a document is found to be unable to be parsed by Tika which throws an exception. The entire 20 documents are not indexed???

It would be much more helpful to not throw a runtime when a document fails to parse and instead log a warning or something less halting.

Environment

Hibernate 4.2.2.Final, MySQL 5.5, Hibernate Search 4.3.Final

Activity

Show:
Haywood J. B.
June 21, 2013, 2:13 PM
Edited

I was unable to make the custom ErrorHandler solution work so I ended up copying and pasting the org.hibernate.search.bridge.builtin.TikaBridge into my codebase and modifying it to log parse errors but keep moving.

I ended up using the following annotations to bring it all together.

Hardy Ferentschik
June 21, 2013, 6:17 PM

I think using your own custom Tika bridge is a good work around for now. I'll have a closer look at this asap. It seems reasonable to hook the ErrorHandler in, but I need to have a closer look at the code.

Haywood J. B.
July 2, 2013, 8:54 PM

To be clear, I was able to get the ErrorHandler implemented in terms of getting it hooked into the workflow. I just wasn't sure how to tell it to ignore the error and continue indexing as if nothing happened.

Thanks!

Hardy Ferentschik
July 30, 2013, 2:25 PM

The underlying problems is that Document creation and as part of this any bridge errors are not handled by the ErrorHandler. At least for the mass indexing case this could be fixed by plugging the ErrorHandler into the right spot in EntityConsumerLuceneWorkProducer.

Sanne Grinovero
July 30, 2013, 4:53 PM

Assignee

Hardy Ferentschik

Reporter

Haywood J. B.

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Affects versions

Priority

Major
Configure