We're updating the issue view to help you get more done. 

Near-zero-downtime mass indexing

Description

Splitting part of the discussion from HSEARCH-2861: even knowing this does not address zero-downtime application updates completely, a near-zero-downtime mass indexing could address at least some use cases, so it's worth considering independently from HSEARCH-2861.

Currently, when we run the mass indexer, we generally need to execute a purge before reindexing: we just drop all the content from the index, and start over from an empty index.

This means that search queries will return very incomplete results during the mass indexing: at first they won't return anything, then they will return more and more result as mass indexing progresses towards 100%.

Another strategy exists, though: when we start reindexing, we could create a new (empty) version of the index, alongside the existing one. Then we would redirect all writes to that new index, both the mass indexing and other writes. When mass indexing is over, we would just redirect reads to the new index, then remove the old one.

The main advantage is that, in lots of cases, search queries would still return usable results during mass indexing. The results would still not be perfect, since they would be out of date, but still, depending on the situation they could be significantly better.

The main disadvantage is that we would require about two times the storage space for the index, for the duration of mass indexing.

Use cases

Periodic reindexing

The main, most appropriate use case would be periodic reindexing. When some entity changes affect way too many documents, you may decide to instruct Hibernate Search to ignore such changes, leaving the index in a partially outdated state. Then you will periodically reindex in order to

Hot application updates

Another use case would be "hot" application updates. If a new version of an application has to be deployed, and that new version uses a slightly different Hibernate Search mapping, there's a good chance that data has to be reindexed (to add new fields, update existing fields that are now stored differently, ...).

There is, however, a big downside to this solution for this use case: if the index structure changed in an incompatible way (fields were removed or had their type changed) between the old and new mapping, some search queries could fail with an exception. For example if a field was a text field in the old index, but new search queries expect it to be numeric: some predicates may fail, projections definitely won't work well, ...

Implementation

Elasticsearch

For this to work, the index needs to be using aliases.
Things get a bit complex if the name of the index configured in Hibernate Search points directly to the index (not to an alias),
because there doesn't seem to be an atomic operation for "rename index A to B and create an alias A pointing to B".
So, maybe we should make sure that Hibernate Search always uses aliases when creating indexes automatically?

Before mass indexing, for each index affected by mass indexing:

  1. Retrieve the name of the index pointed to by the alias myindex. Let's assume the actual name is myindex-0. If myindex is not an alias, fail.

  2. Generate a name for the new index. For example we could use the name of the index in Hibernate Search with an incrementing suffix (myindex-0, myindex-1, ...). We would pick the suffix that follows the one used in the name of the old index, in this case -1, naming the new index myindex-1.

  3. Create the new index and its mapping.

  4. Change the internal state of the index manager so that subsequent writes go to myindex-1 (not reads, just writes). Note that we need some try/finally block to roll back this change if mass indexing fails: the index will end up out of date, but at least it will be usable.

After massindexing, for each index affected by mass indexing:

  1. Change the alias myindex to point to myindex-1.

  2. Remove myindex-0.

  3. Change the internal state of the index manager so that subsequent writes go to myindex.

Embedde Lucene

Before mass indexing, for each index affected by mass indexing:

  1. Create the new index on disk; for example use the same name as the old index, with a _new suffix.

  2. Change the internal state of the index manager so that subsequent writes go to the new index (not reads, just writes). Note that we need some try/finally block to roll back this change if mass indexing fails: the index will end up out of date, but at least it will be usable.

After massindexing, for each index affected by mass indexing:

  1. Acquire a lock on the index to be sure that no read/writes are currently in progress.

  2. Remove the old index.

  3. Move the new index to the exact place where the old index was; for example remove the _new suffix.

  4. Change the internal state of the index manager so that subsequent writes go to the appropriate index.

  5. Release the lock on the index

Depending on the Lucene Directory, we might be able to avoid the lock, but that should be investigated.

Variations

We could do everything as described above, except that instead of being redirected to the new index, non-massindexer writes are duplicated to the new index. That way, search queries could continue to get up-to-date results.

Of course, we would put extra pressure on the backend (twice the writes, on top of the mass indexing), so this is not appropriate for all situations.

Note this may not work very well in the case of a mapping change: we would write into the old index documents that are meant for the new index.
Worse, in that case we could corrupt the whole index (by inserting fields that did not make sense in the old schema) and make an application rollback much harder, because the index would have to be rebuilt.

Environment

None

Status

Assignee

Unassigned

Reporter

Yoann Rodière

Fix versions

Priority

Major