input stream support

Description

The current hibernate search functionality is not optimized for dealing with large text contents. Two use cases:

1. indexing an external PDF that's 100MB where an @Field is set on a getter
2. indexing a @Lob field

in both cases, the method must return a string, or a base class, which might mean that you have an InputStream that's 50MB, which gets concatenated into a string, and then passed to an analyzer bundled into a Reader object. I'm unclear what HibernateSearch is doing when the getter for the @Field annotation is called, but it would be ideal if it could use a reader instead of a string

Activity

Show:

adamb August 29, 2011 at 11:26 PM

I've been testing on a server. Moving the reader and stream initialization directly into the LazyField, and passing in a List of URIs does appear to solve the too many files open issue, at least in this case. I haven't profiled this yet, but seems like this is going to be the "best" performance possible.

thanks

Sanne Grinovero August 28, 2011 at 6:40 PM

even generally Lucene will use a lot of file handles, and webservers do too. I wonder if you already have raised the limits? Server/enterprise Linux distributions come preconfigured with a generous amount, but desktop/developer oriented Linux distributions usually have an insufficient amount.

Even if you already have a generous kernel limit, you make a good point that this design does not allow you to control the number of open streams. I think you should not open the stream initially when creating the LazyField, but you should rather pass enough information (file path?) to the LaziField implementation to open the reader only when it's needed, and close it too. So you avoid opening a resource in one thread and closing it in another, which is generally a bad idea, and also there won't be more readers open than the amount of workers in the thread_pool.

adamb August 28, 2011 at 6:11 PM

revisiting this, and happy to open a separate ticket, there does seem to be another, deeper issue here using the LazyField model. If you pass in the readers, there's no control over the number of readers that may be open at a given moment. If I limit the following:

  • hibernate.search.worker.batch_size

  • hibernate.search.worker.thread_pool.size

  • hibernate.search.worker.buffer_queue.max

I still get errors from Lucene that I have too many files open, even though:

  • if I add a reader.close() immediately after adding the field to the document, I get closed stream errors

  • if I add a finalize() method that closes the stream I still get a too many files open error.

  • changing modes from lazy=true to lazy=false does nothing

  • disabling async mode does nothing

Is there another parameter that can be used?

adamb August 25, 2011 at 12:10 AM

Sanne,
thanks for your comments. I'm trying to optimize an issue locally where we will be pulling in multiple files (sometimes large). Hence, trying to avoid string concatenation due to the memory issue. What I've found is that the Fieldable class (according to the documentation) should happily work with a reader if it's given one instead of the String. Documenting what I did (as this get's indexed in google)

  1. changing the stringValue to return null and implementing a reader works well

  2. using a SequenceInputStream allows me to wrap the FileInputStreams into a single reader

  3. changing the FieldBridge to process a reader and pass it to the LazyField

Sanne Grinovero August 24, 2011 at 4:55 PM

Hi Adam,
I agree on opening this issue, no doubt better support for this should be explicit. I only linked to the original blogpost in case you where looking for something to have it working with a current release.

Yes stringValue() is invoked at some point, but the trick in that case is that it's invoked by another thread, as the backend is configured async so there is no performance hit on the main application thread. Or do you need it to never invoke the stringValue ?

Details

Assignee

Reporter

Components

Priority

Created August 24, 2011 at 2:55 AM
Updated October 11, 2023 at 12:17 PM