Tika StringBrigde
Description
Activity

Hardy Ferentschik July 18, 2012 at 5:16 PM
it's not a pull request?
I guess I might as well make it a pull request. We can discuss potential changes on the pull request

Sanne Grinovero July 18, 2012 at 5:04 PM
it's not a pull request?

Hardy Ferentschik July 18, 2012 at 4:42 PM
Here is a potential @TikaBridge implementation - https://github.com/hferentschik/hibernate-search/tree/HSEARCH-1171
Comments welcome.

Hardy Ferentschik July 11, 2012 at 7:09 PM
I experimented with your example project - https://github.com/hferentschik/hibernate-search-tika/tree/tika-blob-based - and switched to a stream based approach across. This way you don't have to materialize the byte arrays. The Book entity uses java.sql.Blob now I am using the LobHelper to create the Blob (I have to revert to Hibernate specific APIs though).
Another side effect of using Blob_s is that I atm cannot use the mass indexer, but have to use either automatic indexing or the indexing API of _FullTextSession.
I tested this approach also against PostgreSQL and MySQL and in both cases the tests run much faster (6 to 8 seconds for me).
What do you think about this approach?
Another idea regarding Tika integration - we could add a TikaBridge to the Search code base. When used it would dynamically try to discover/load the Tika classes (eg it could look for AutoDetectParser). The bridge could handle multiple types (Blob, byte[], and whatever else we could come up with). WDYT? Is this a good approach to integrate Tika into Search? Any better ideas or suggestions?

Hardy Ferentschik July 9, 2012 at 3:34 PMEdited
Hi Florent,
thank you for taking the time to review this behavior.
No worries. We are in fact looking for some good Tika integration code. That's something we are very interested in and any help is welcome
To answer your question, I need to save the binary in the database, that's part of a requirement.
Fair enough then.
What is really puzzling me is that the same document can be converted in a few seconds in a unit test (ByteArrayBridgeTest) which is the excepted behavior... but can either throw an OutOfMemoryException or can take minutes within an Hibernate search context.
Well, in ByteArrayBridgeTest you don't really do much at all. You just read the input stream, pipe it through Tika and create a string. There is no indexing involved and what's more important no database access. When I run the tests and step through it to see where most time is spend, it is em.flush(); which is the main bottleneck. hsqldb is probably not so well suited for this type of tests. Have you experimented with other databases? Also you might consider working with java.sql.Blob. This way you might not have to load the whole data into memory. Have a look at the org.hibernate.LobHelper class.
See also:
Details
Details
Assignee

Reporter

I created a Tika based StringBridge to convert any byte[] to its string representation.
When unit-testing this StringBrige everything works fine.
When integration-testing this StringBriget some element does not work.
I created a project that show this behavior here:
https://github.com/framiere/hibernate-search-tika