Near-zero-downtime mass indexing when the schema did not change

Description

Splitting part of the discussion from HSEARCH-2861: Zero-downtime/hot schema updates for the Elasticsearch backendAwaiting contribution : even knowing this does not address zero-downtime application updates completely, a near-zero-downtime mass indexing could address at least some use cases, so it's worth considering independently from HSEARCH-2861.

Currently, when we run the mass indexer, we generally need to execute a purge before reindexing: we just drop all the content from the index, and start over from an empty index.

This means that search queries will return very incomplete results during the mass indexing: at first they won't return anything, then they will return more and more result as mass indexing progresses towards 100%.

Another strategy exists, though: when we start reindexing, we could create a new (empty) version of the index, alongside the existing one. Then we would redirect all writes to that new index, both the mass indexing and other writes. When mass indexing is over, we would just redirect reads to the new index, then remove the old one.

The main advantage is that, in lots of cases, search queries would still return usable results during mass indexing. The results would still not be perfect, since they would be out of date, but still, depending on the situation they could be significantly better.

The main disadvantage is that we would require about two times the storage space for the index, for the duration of mass indexing.

Use cases

Periodic reindexing

The main, most appropriate use case would be periodic reindexing. When some entity changes affect way too many documents, you may decide to instruct Hibernate Search to ignore such changes, leaving the index in a partially outdated state. Then you will periodically reindex in order to

Hot application updates

Another use case would be "hot" application updates. If a new version of an application has to be deployed, and that new version uses a slightly different Hibernate Search mapping, there's a good chance that data has to be reindexed (to add new fields, update existing fields that are now stored differently, ...).

There is, however, a big downside to this solution for this use case: if the index structure changed in an incompatible way (fields were removed or had their type changed) between the old and new mapping, some search queries could fail with an exception. For example if a field was a text field in the old index, but new search queries expect it to be numeric: some predicates may fail, projections definitely won't work well, ...

Implementation

Elasticsearch

For this to work, the index needs to be using aliases.
Things get a bit complex if the name of the index configured in Hibernate Search points directly to the index (not to an alias),
because there doesn't seem to be an atomic operation for "rename index A to B and create an alias A pointing to B".
So, maybe we should make sure that Hibernate Search always uses aliases when creating indexes automatically?

Before mass indexing, for each index affected by mass indexing:

  1. Retrieve the name of the index pointed to by the alias myindex. Let's assume the actual name is myindex-0. If myindex is not an alias, fail.

  2. Generate a name for the new index. For example we could use the name of the index in Hibernate Search with an incrementing suffix (myindex-0, myindex-1, ...). We would pick the suffix that follows the one used in the name of the old index, in this case -1, naming the new index myindex-1.

  3. Create the new index and its mapping.

  4. Change the internal state of the index manager so that subsequent writes go to myindex-1 (not reads, just writes). Note that we need some try/finally block to roll back this change if mass indexing fails: the index will end up out of date, but at least it will be usable.

After massindexing, for each index affected by mass indexing:

  1. Change the alias myindex to point to myindex-1.

  2. Remove myindex-0.

  3. Change the internal state of the index manager so that subsequent writes go to myindex.

Embedde Lucene

Before mass indexing, for each index affected by mass indexing:

  1. Create the new index on disk; for example use the same name as the old index, with a _new suffix.

  2. Change the internal state of the index manager so that subsequent writes go to the new index (not reads, just writes). Note that we need some try/finally block to roll back this change if mass indexing fails: the index will end up out of date, but at least it will be usable.

After massindexing, for each index affected by mass indexing:

  1. Acquire a lock on the index to be sure that no read/writes are currently in progress.

  2. Remove the old index.

  3. Move the new index to the exact place where the old index was; for example remove the _new suffix.

  4. Change the internal state of the index manager so that subsequent writes go to the appropriate index.

  5. Release the lock on the index

Depending on the Lucene Directory, we might be able to avoid the lock, but that should be investigated.

Variations

We could do everything as described above, except that instead of being redirected to the new index, non-massindexer writes are duplicated to the new index. That way, search queries could continue to get up-to-date results.

Of course, we would put extra pressure on the backend (twice the writes, on top of the mass indexing), so this is not appropriate for all situations.

Note this may not work very well in the case of a mapping change: we would write into the old index documents that are meant for the new index.
Worse, in that case we could corrupt the whole index (by inserting fields that did not make sense in the old schema) and make an application rollback much harder, because the index would have to be rebuilt.

Activity

Show:

Yoann Rodière June 4, 2024 at 8:00 AM

Nice blog post explaining the problem and solution in great details: https://quarkus.io/blog/search-indexing-rollover/

Yoann Rodière June 18, 2020 at 7:12 AM
Edited

Is there a way to access the (expected) mapping of a given index?

At the moment, no, there isn't a way to access it programmatically. The best you can do is to run the schema creation in a development environment and get the resulting mapping from Elasticsearch.

*EDIT*: Related: https://hibernate.atlassian.net/browse/HSEARCH-2366#icft=HSEARCH-2366

But when creating the new index (by calling the ES API directly) I don’t see how to post the index mappings along with the new index.

I think you mean you don't see how to guess the new index mapping that you must create in the new index? But just in case, here is how to post the mapping while creating the index.

Am I taking the wrong approach or just missing how to access the updated IndexMetadata?

To be honest there is something wrong: zero-downtime reindexing as described in the documentation will only work if your mapping did not change. If your mapping did not change, you can use the rollover API to create the new index without knowing anything about the mapping.

This ticket, and the methodology described in the documentation, will work for periodic reindexing, but definitely are not enough for application updates. I suggest you have a look at https://hibernate.atlassian.net/browse/HSEARCH-2861#icft=HSEARCH-2861 and its comments, where we discussed a few of the problems involved in zero-downtime application updates. The main problem is that old-gen instances of your applications should be allowed to write anything after you created the new index, because they will generate wrong or incomplete documents. Conversely, new instances of your application may not be able to read from the old index, since they have a different metamodel and may assume fields are present while they are not, or assume fields have a different type than they have in the old index.

So, the only "generic", guaranteed-to-work solution here would be to completely separate your applications:

  • Make old-gen applications read-only, and I mean really read-only. They must not write anything, not even to the database, or your indexes will become out-of-sync at best, or will contain invalid data in the worst case.

  • Make new-gen applications write-only. They must not handle search requests, because they don't understand the old mapping anymore.

As to how you will do that... I'll leave the routing of user requests to you, because Hibernate Search simply can't handle that: when it becomes involved in the read or write process, it's already too late. I'll just point out that if you go to such lengths, you may as well assign a different name to your index in the new version of your application (@Indexed(name = "myindex-v2", so you will have completely separate aliases and indexes for your two applications. You may be able to take advantage of a custom layout strategy that automatically appends the version of your application to your index name and aliases (myindex-appV2-000001, myindex-appV2-read, myindex-appV2-write), but that's about as far as Hibernate Search can help you.

*EDIT*: Created https://hibernate.atlassian.net/browse/HSEARCH-3953#icft=HSEARCH-3953 to clarify the documentation.

Matt Howard June 17, 2020 at 5:14 PM

Is there a way to access the (expected) mapping of a given index? I am attempting to manually implement a zero-downtime reindexing (ES backend) as outlined in the docs: https://docs.jboss.org/hibernate/search/6.0/reference/en-US/html_single/#backend-elasticsearch-indexlayout

But when creating the new index (by calling the ES API directly) I don’t see how to post the index mappings along with the new index.

Create a new index, myindex-000002.

I’m attempting to do this when our mappings change so I can’t just get the existing JSON from the ES API and copy them. I believe I need to use the ES API directly rather than using the SchemaManager in hsearch, because schema management doesn’t like the read/write aliases pointing to different indexes:

HSEARCH400593: Index aliases [test_indexed_entity-write, test_indexed_entity-read] are assigned to a single Hibernate Search index, but they are already defined in Elasticsearch and point to multiple distinct indexes: [test_indexed_entity-000002, test_indexed_entity-000001]

Am I taking the wrong approach or just missing how to access the updated IndexMetadata?

Assignee

Unassigned

Reporter

Fix versions

Priority

Created February 25, 2019 at 10:58 AM
Updated June 4, 2024 at 8:00 AM