Maximize utilization of database connections during mass indexing

Description

The mass indexer threads responsible for loading entities from the database currently have a loop that looks like this:

load X entities from the database
submit indexing requests for each entity
wait for indexing to finish for these entities
repeat

The "wait" step means that, while the backend is busy indexing, the thread will not load anything from the database. This effectively means that it will hold a reference to a database connection for nothing during that time.

This means that indexing doesn't execute in parallel of database loading, and as a result the execution time of mass indexing is probably close to the sum of the time spent loading entities and the time spent indexing. Ideally, we'd want those two operations to happen in parallel, so that the execution time of mass indexing is close to the maximum instead of the sum.

You can see the effect quite clearly in

, which is a gantt chart of the indexing tasks in an Elasticsearch backend during mass indexing. From time to time, almost all executors are idle, because entities have not been submitted yet... If entity loading happened in parallel, the indexing executes would be less likely to stay idle.

One solution to this problem would be to move to a loop like this:

load X entities from the database (load #1)
submit indexing requests for each entity
load X entities from the database (load #2)
submit indexing requests for each entity
wait for indexing to finish for load #1
load X entities from the database (load #3)
submit indexing requests for each entity
wait for indexing to finish for load #2
load X entities from the database (load #4)
submit indexing requests for each entity
wait for indexing to finish for load #3
...

This should greatly reduce the amount of waiting in the loading thread, since we will be loading the next batch of entities while the previous batch is being indexed. And this preserves the characteristics of the previous algorithm: if database loading is much faster than indexing (who knows...), we won't flood indexing queues with indexing requests, and will only ever have at most two batches of entities pending at any given time.

This would probably be rather easy to implement, since the completion of indexing is modelled by CompletableFutures: we just have to store the future of the last batch somewhere, and wait for it to finish.

Attachments

Activity

Show:

Fixed

Details
Assignee
Yoann Rodière
Reporter
Yoann Rodière
Components
Sprint
None
Fix versions
6.1.0.Alpha1
Priority
Major

Created April 1, 2020 at 3:19 PM

Updated September 10, 2021 at 7:24 AM

Resolved July 5, 2021 at 8:53 AM

Configure

Maximize utilization of database connections during mass indexing

Description

Attachments

Activity

DetailsAssigneeYoann RodièreYoann RodièreReporterYoann RodièreYoann RodièreComponentsSprintNone+2Fix versions6.1.0.Alpha1PriorityMajor

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Details
Assignee
Yoann Rodière
Reporter
Yoann Rodière
Components
Sprint
None
Fix versions
6.1.0.Alpha1
Priority
Major