Maximize utilization of database connections during mass indexing

Description

The mass indexer threads responsible for loading entities from the database currently have a loop that looks like this:

  • load X entities from the database

  • submit indexing requests for each entity

  • wait for indexing to finish for these entities

  • repeat

The "wait" step means that, while the backend is busy indexing, the thread will not load anything from the database. This effectively means that it will hold a reference to a database connection for nothing during that time.

This means that indexing doesn't execute in parallel of database loading, and as a result the execution time of mass indexing is probably close to the sum of the time spent loading entities and the time spent indexing. Ideally, we'd want those two operations to happen in parallel, so that the execution time of mass indexing is close to the maximum instead of the sum.

You can see the effect quite clearly in

, which is a gantt chart of the indexing tasks in an Elasticsearch backend during mass indexing. From time to time, almost all executors are idle, because entities have not been submitted yet... If entity loading happened in parallel, the indexing executes would be less likely to stay idle.

One solution to this problem would be to move to a loop like this:

  • load X entities from the database (load #1)

  • submit indexing requests for each entity

  • load X entities from the database (load #2)

  • submit indexing requests for each entity

  • wait for indexing to finish for load #1

  • load X entities from the database (load #3)

  • submit indexing requests for each entity

  • wait for indexing to finish for load #2

  • load X entities from the database (load #4)

  • submit indexing requests for each entity

  • wait for indexing to finish for load #3

  • ...

This should greatly reduce the amount of waiting in the loading thread, since we will be loading the next batch of entities while the previous batch is being indexed. And this preserves the characteristics of the previous algorithm: if database loading is much faster than indexing (who knows...), we won't flood indexing queues with indexing requests, and will only ever have at most two batches of entities pending at any given time.

This would probably be rather easy to implement, since the completion of indexing is modelled by CompletableFutures: we just have to store the future of the last batch somewhere, and wait for it to finish.

Environment

None

Assignee

Unassigned

Reporter

Yoann Rodière

Labels

None

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Priority

Major
Configure