Reintroduce support for clustered applications

Description

This is delayed until 6.1 on purpose.

See for early experimentation.

There are multiple ways to introduce clustering, but we discussed two main approaches so far:

As in Search 5, clustering at the backend level: process "entity changes" locally, then send "document change" events to a message queue, to be processed by other nodes, each nodes having an assigned index or shard that only this node will handle.
Clustering at the mapper level: directly send "entity changes" events to the message queue, to be processed by other nodes, each nodes having an assigned entity or shard that only this node will handle.

The second approach is probably what we will end up doing, but I'm mentioning both for the sake of completeness.

Messaging technology

A few technologies have been mentioned, but using Kafka would give us a great starting point to add Debezium integration (see ), since Debezium uses Kafka.

Approach 1 (dismissed): document change events

See Search 5. Essentially: build documents where the entity change happens, and then route the document to the correct node.

One problem common to all these solutions is that two concurrent transactions might impact the same document, because of @IndexedEmbedded... So when we get two "document changes" event, the acceptable behavior might not be to "pick a winner", but just to "merge the events", which may be impossible to do at that point (lacking the information about entity changes).

See

Approach 2: entity change events

Instead of sharing indexing events, e.g. “add this document that I already generated”, share entity events, e.g. “indexed entity MyType with ID 1 should be added”.
On the other end of the queue, consume the events by loading the entities, processing them, and indexing the resulting document.

Compared to the “document change event” solution, this makes more of the indexing process asynchronous, potentially reducing the overhead of Hibernate Search for the user. In a web application for example, this could shave off some time from the processing of each HTTP request.

In particular, a single entity change might trigger lots of reindexing works in these cases:

When the entity is embedded in lots of other indexed entities
When the entity is embedded in a few very heavy indexed entities, which will trigger the loading of many other entities during their reindexing.
When we process entities within the same transaction that made the changes, having to perform all these additional loads might end up delaying the transaction (or the post-transaction synchronization) for quite a long time. Reducing the loads to just what’s necessary to determine which entities must be reindexed might provide a big improvement in these cases.

Pros:

Solves the concurrency issues mentioned in solution 1 in a very simple way: if we get two "entity changes" events in a row, even in the wrong order, the second one will be processed and trigger reindexing based on the current database state, and as a result it will always produce an up-to-date document taking into account both changes.
Reduces the latency in user applications (not indexing latency, but actual HTTP request processing latency): we don't bloat user transactions with additional database loading for indexing.
Reduces the risk of OOM errors in big reindexing events: we can control when to clear and flush a session, which we cannot do when we’re using the user’s Session. We could flush and clear between each indexing, or every 10 of them, … We’re free!
Might be improved even further by moving the processing of “contained” types to the queue, too.
In a “hot-update” scenario (see ), where we want to process each event twice (once with the old mapping, once with the new one), we would only have to duplicate the queue (send events to both the old and new nodes); everything else would be business as usual.

Cons:

Requires asynchronous processing to make sense: waiting for the queued event to be consumed would make no sense.
Requires DB access from the “processing” nodes. This might not be a big deal in most architectures, though, because users are likely to use their application nodes as processing nodes, and to not have dedicated “processing only” nodes.
Probably increases database usage, as we cannot benefit from the first-level session cache anymore.

Attachments

Linked issues

duplicates

HSEARCH-3280

Simple, asynchronous, distributed reindexing relying on the database only (without any extra infrastructure)

relates to

HSEARCH-2364

Fully asynchronous automatic indexing

Activity

Yoann RodièreMay 31, 2021 at 8:34 AM

Closing as duplicate of to simplify tracking. We will track everything related to async, distributed automatic indexing on from now on.

Duplicate

Details
Assignee
Yoann Rodière
Reporter
Yoann Rodière
Priority
Major

Created August 28, 2018 at 6:35 AM

Updated May 31, 2021 at 8:34 AM

Resolved May 31, 2021 at 8:34 AM