When using the mass indexer and a failure occurs to parse a document either the whole block of indexed documents gets thrown out or everything after the exception gets thrown out. I'm still trying to figure out if the any documents before the exception are indexed...I suspect not.
I start to index using the mass indexer and grab 20 classes and start indexing. The first 7 classes are fine and everything indexes properly. On the 8th class a document is found to be unable to be parsed by Tika which throws an exception. The entire 20 documents are not indexed???
It would be much more helpful to not throw a runtime when a document fails to parse and instead log a warning or something less halting.
Hibernate 4.2.2.Final, MySQL 5.5, Hibernate Search 4.3.Final
I was unable to make the custom ErrorHandler solution work so I ended up copying and pasting the org.hibernate.search.bridge.builtin.TikaBridge into my codebase and modifying it to log parse errors but keep moving.
I ended up using the following annotations to bring it all together.
I think using your own custom Tika bridge is a good work around for now. I'll have a closer look at this asap. It seems reasonable to hook the ErrorHandler in, but I need to have a closer look at the code.
To be clear, I was able to get the ErrorHandler implemented in terms of getting it hooked into the workflow. I just wasn't sure how to tell it to ignore the error and continue indexing as if nothing happened.
The underlying problems is that Document creation and as part of this any bridge errors are not handled by the ErrorHandler. At least for the mass indexing case this could be fixed by plugging the ErrorHandler into the right spot in EntityConsumerLuceneWorkProducer.