Batching IndexWriter commits for multiple synchronous worksets

Description

An NRT based backend is orders of magnitude more efficient than a non-NRT one as it avoids a full flush cycle to the index for each LuceneBackendQueueTask which is applied to the index, and only triggers such flush to the Directory on specific needs (like when the in-memory buffers are running out of space).

But since changes are buffered in memory rather than being pushed to the store right away, an application requiring synchronous changes but using the Infinispan Directory to replicate changes to other application nodes can't use NRT as these changes would not be propagated synchronously, they would only be visible eventually when a flush event is triggered.

Consider that the Infinispan Directory can provide exceptional read efficiency but is not efficient at all at handling:

read locks as they need to be emulated with multiple RPCs doing optimistic CAS operations
index metadata replacements are also based on distributed CAS operations
a single segment write might trigger up to 8 different index metadata writes
one out of 10 (by default, or as configured the merging factor) writes on segment will trigger a rewrite of all segments
load on any IndexReader refresh will increase contention on the above CAS operations
and IndexReader refresh is triggered by visibility of a flush event

If you combine the notion of each of these bullet points, it's clear that each increase of IndexReader refresh operations, and each IndexWriter commit have an exponential effect on worsening the mount of network traffic and make the CAS operations unlikely to ever spin out.
The most important observation is that the IndexReader refresh operations are affected linearly by the frequency of commits: a refresh is directly caused by a previous commit. So simply controlling the frequency of IndexWriter commit events will directly control the frequency of IndexReader refresh operations, and so has the potential to dramatically increase the write throughput on the InfinispanDirectoryProvider.

So the core of the idea here is that for two incoming changesets A and B, incoming on behalf of two parallel user transactions TXA and TXB, normally we would apply this sequence of operations:

TXA enqueues A, and blocks waiting for it.
TXB enqueues B, and blocks waiting for it.
IndexWriter thread takes A from the Queue
IndexWriter thread applies A changes to the index
IndexWriter thread commits and flushes the index
TXA is notified and is unblocked
IndexWriter thread takes B from the Queue
IndexWriter thread applies B changes to the index
IndexWriter thread commits and flushes the index
TXB is notified and is unblocked

The above is the current strategy, and releases the waiting threads as soon as possible.
Let's assume that each write takes some nanoseconds, and each commit takes 200ms; actual figures will vary across systems and configurations but won't be too far from these orders of magnitude; so for the sake of this design we can approximate the cost of each write to nothing, while a commit time is significant.

The latency of the above described schema will then be approximately:

TXA will return in 1 commit (plus a nanosecond)
TXB will return in 2 commits (plus a couple nanoseconds)

If we now reorder the commit sequence to actually perform something like this:

TXA enqueues A, and blocks waiting for it.
TXB enqueues B, and blocks waiting for it.
IndexWriter thread takes A&B from the Queue
IndexWriter thread applies A&B changes to the index (maintaining order)
IndexWriter thread commits and flushes the index
TXA is notified and is unblocked
TXB is notified and is unblocked

at this point the measured latency will be:

TXA will return in 1 commit (plus a couple nanoseconds)
TXB will return in 1 commit (plus a couple nanoseconds)

Compared to the previous results, TXA was delayed by some nanoseconds but the order of magnitude of its response time is not affected.
TXB however did cut its waiting time in half.. more importantly, it's speed is increased by a factor N where N in this case is 2 as we have only two transactions writing, but N could be 100, you'd have a 100X performance improvement.

So what's the throughput that we get out of this pattern?

What matters most is to cap the target frequency of commits to a specific upper bound, and we want to cap this frequency at exactly the maximum throughput that the Infinispan storage can actually take. So if the IndexWriter thread is designed as a busy loop which always takes all of N waiting changesets and applies them, if the commit is slow there will be more changes in the queue waiting for the next iteration of the write loop, making N larger and so scaling up the throughput: the more the storage is slow, the larger the batches.

The upper bound is of course determined by the speed of writes - the ones that we previously considered negligible as they operate in the nanoseconds range; they will stack up and eventually become the limiting factor. The implementation needs also to take into account at least two exceptional cases:

what's the IndexWriter loop going to do when no changes are scheduled
do we need to cap the maximum size of batches

Ultimately the latency of each user thread is still going to be close to the performance that we can get out of the store to perform a single commit operation, as at least one commit needs to happen before the synchronous user thread is unblocked. The benefit will be that different threads won't "pile up" changes creating larger wait times but the response time is expected to be quite predictable, whatever the load.

Systems with an high degree of parallelism will benefit from this, and the performance should converge to the performance you would have without every doing a commit; however if the frequency of commits is apparoching to zero, it also means that the average latency of each operation will get significantly higher. Still, in such situations assuming we are for example stacking up a million changesets between each commit, that implies this solution would be approximately a million times faster than the existing design (A million would not be realistic of course as it implies a million of parallel requests).

Activity

Show:

Sanne GrinoveroNovember 13, 2014 at 5:28 PM

https://github.com/hibernate/hibernate-search/pull/700

Fixed

Details
Assignee
Gustavo Fernandes
Reporter
Sanne Grinovero
Components
infinispan
Fix versions
4.4.6.Final
4.5.2.Final
5.0.0.Beta2
Priority
Critical

Created October 17, 2014 at 3:48 PM

Updated November 19, 2014 at 12:05 AM

Resolved November 19, 2014 at 12:05 AM

Configure

Batching IndexWriter commits for multiple synchronous worksets

Description

So what's the throughput that we get out of this pattern?

Activity

Sanne GrinoveroNovember 13, 2014 at 5:28 PM

DetailsAssigneeGustavo FernandesGustavo FernandesReporterSanne GrinoveroSanne GrinoveroComponentsinfinispanFix versions4.4.6.Final4.5.2.Final5.0.0.Beta2PriorityCritical

Details

Assignee

Reporter

Components

Fix versions

Priority

Details
Assignee
Gustavo Fernandes
Reporter
Sanne Grinovero
Components
infinispan
Fix versions
4.4.6.Final
4.5.2.Final
5.0.0.Beta2
Priority
Critical