Meaning an entity model with 10, maybe 20 entities or more, lots of associations, and with data involving big entity graphs (because long chains of associations, cycles, ...)
This would be useful for multiple reasons:
To keep bootstrap performance in check (be sure that we don't have some code running in O(e^n))
To compare bootstrap performance between Search 5 and 6
To compare indexing performance between Search 5 and 6, in particular in mappings with a lot of @IndexedEmbeddeds (which means a lot of reindexing of other entities when one entity change) and a lot of non-indexed properties (which should not trigger reindexing when they change)
Here are some sore points and possible solutions (if performance tests confirm that these are indeed a problem):
BatchingExecutor (created in by extracting code from the Elasticsearch backend)
If the same executor is used from multiple indexes (as it is in Elasticsearch), the wait when closing may be delayed indefinitely; see
By using one executor per index, we spawn one thread per index. Maybe we could allow users to limit the number of threads by using a single executor with a thread pool and multiple queues? Each thread would have to put a lock on the queue it uses to ensure indexes are not updated concurrently. (Sanne proposed this a while ago because it was important for Infinispan, if I remember correctly)
If the same executor is used from multiple indexes (as it is in Elasticsearch), we may end up with batches consisting only of requests to a single index. Which is a shame considering that by targeting different indexes we would be more likely to rely on different machines for execution and thus to improve parallelism. To avoid that, we could set up one queue per index and make the executor take works from multiple queues for a single batch? This would only make sense for Elasticsearch, where works are executed asynchronously; for Lucene we probably want multiple threads, each serving one index at a time, as explained above.
The executor waits for all works to finish before starting another batch. While this may not be a problem for Lucene (where we execute works synchronously anyway), for Elasticsearch this means at the end of each batch all the connections to the ES cluster are completely idle. We might want to switch to a more "continuous" algorithm that starts more requests as soon as one connection frees up, especially when the work queue is full or nearly full. Problem: do we have a way to "listen" to available connections? Related:
Elasticsearch serial orchestrator
Serial execution may be completely unnecessary, because:
I'm not sure works in a single workset are sensistive to ordering
We know for sure that we already have problems with preserving relative ordering of worksets (which should be the same of the relative ordering of their source transactions), so we're probably already failing at preserving ordering. That should be more or less solved once we switch to a fully asynchronous indexing, however: see and "Approach 2: entity change events" in
I don't think bulk requests execute items in order anyway, so we're probably already failing at preserving ordering.
Thus, we might want to experiment with parallel execution that only preserves relative ordering of works affecting the same document. For example by removing all but the last work affecting the same document in a given execution batch.