For the DirectoryBasedIndexManager, the performStreamOperation method will add the work to the index without committing or flushing, but for the ElasticsearchIndexManager it adds the work to a queue that will be applied to the index eventually.
When using batching backends to re-index data, this cause issues for Infinispan. Consider the usual sequence of operations for this scenario:
Purge indexes (once)
BatchBackend.enqueueAsyncWork (several times)
When using the DirectoryBasedIndexManager, a call to BatchBackend.flush is safe in the sense that the previously calls to enqueueAsyncWork are applied to the index (but not committed), but for ElasticsearchIndexManager calling a flush after a enqueueAsyncWork could cause the flush to happen before a work is added, causing some documents to be "missed" (not visible for searches).
Ideally the behaviour of those two indexmanagers should be aligned. Also, it's important to have a way to detect if there are pending work not submitted to the backend index so that a flush can be called safely.
I agree, let's not go that route. Specially because for Elasticsearch submitting a batch of changes is much better than submitting one by one. But the DirectoryIndexmanager could also have an async option as well so that both offer similar capabilities.
That's right. I don't kown how complex that would be, maybe this could be tackled as part of this ticket.
This does not work well for the DirectoryIndexManager: it could lead to cascade failures since flush can be very expensive after inserting a large amount of data. That's why flush is carefully done once. For the Elasticsearch I haven't tested it, but I suppose it could go better since they prevent multiple flushes in parallel.
Interesting. So, works submitted through org.hibernate.search.indexes.spi.DirectoryBasedIndexManager.performStreamOperation(LuceneWork, IndexingMonitor, boolean) eventually end up being executed in org.hibernate.search.backend.impl.lucene.LuceneBackendTaskStreamer.doWork(LuceneWork, IndexingMonitor), which uses a non-exclusive lock when applying changes (org.hibernate.search.backend.impl.lucene.LuceneBackendResources.getParallelModificationLock()).
But there also is an exclusive lock available (org.hibernate.search.backend.impl.lucene.LuceneBackendResources.getExclusiveModificationLock()). Maybe we should use that exclusive lock when flushing, similarly to what Elasticsearch probably does? That way, pending synchronous works would wait while we flush, and resume later.
Anyway... I admittedly don't know much about this part of Hibernate Search. I think we'll have to wait for 's opinion before we go further, be it breaking the SPI or using an exclusive lock for flushes.
So one proposal for 5.6.x, as signaled would be to add the following method to org.hibernate.search.backend.spi.BatchBackend:
Implementation wise, it's a no-op for the DirectoryBasedIndexManager until async is supported in the future (no pun intended), and for ElasticsearchIndexManager, it delegates to RequestProcessor waiting method.
This proposal solves the MassIndexer issue for Infinispan.
Thanks . Moving this to 5.6.CR1 as I'd rather not make those changes after the CR (if we do).
, what's your opinion on this?
I also proposed a change that might not require adding methods to SPIs above, but since I'm not sure it's a valid use case, I'd like a second opinion. I made the changes on a branch to at least see if it breaks any test: https://github.com/yrodiere/hibernate-search/tree/HSEARCH-2492