Query based approach for reindexing resolution
Description
depends on
follows up on
Activity

Yoann Rodière March 28, 2024 at 3:42 PM
Actually this would still be useful when a given entity is indexed-embedded in many (say, 500k) other entities. Then:
Modeling the inverse side of the indexed-embedded association just doesn’t make sense: it cannot be loaded in memory anyway.
A change to the embedded entity leads to 500k entities being reindexed; the only way we could process this would be through a query that loads the id of “containing entities”, and reindexes them in batches, taking great care of flushing/clearing any sessions involved between batches.
See the conversation we had with :
Yoann Rodiere: Note that HSEARCH-1937 alone won't fix the problem, as we'll still be loading 500k entities in memory.
However, if we coupled this to the outbox-polling strategy... then, when we execute the query to resolve entities to reindex, we would be able to clear/flush the session periodically! And it becomes feasible memory-wise.
We'd be creating 500k events in the outbox table though xDYoann Rodiere: We could consider marking such associations as "too big" [EDIT: e.g. ReindexOnUpdate.BATCH]. We'd have a separate table where we store the type and ID of indexed-embedded entities that changed (
ObjectB
), we'd have some sort of periodic, automatic mass indexing that would reindex only the corresponding containing entities (ObjectA
). That could do the trick, since outbox-polling event processors nicely stop processing events when we start mass indexing.Yoann Rodiere: Or we leave mass indexing out of this and let outbox event processors acquire an "exclusive lock" on a given entity type, forbidding other processors to process a given entity type (
ObjectA
) just the time to go through all results of that query and to reindex all affected instances, using flush/clear as necessary to not use too much memory. That could work too.Marko Bekhta: hmm, yeah, with such huge lists, the massindexing batching is nice and is already there :see_no_evil: so it's like massindexer-with-a-condition

Yoann Rodière March 15, 2023 at 2:23 PM
Note that, with , use cases that actually require explicitly writing down the query would become quite rare.
Also, in Hibernate Search 6 we don't have @ContainedIn
anymore, so I think the most obvious place to fit that feature would be @AssociationInverseSide
(see here). Maybe @AssociationInverseSide(query = "...")
?
A likely use case would be associations with @Where/@WhereJoinTable
(see here)… Maybe?
Emmanuel Bernard July 28, 2015 at 6:33 AM
Comes from this discussion https://forum.hibernate.org/viewtopic.php?f=9&t=1040726&p=2486062#p2486062
Today, people need to create an association to go from an embedded entity B to the containing entity A. This is necessary for us to know which instance of A contains the instance of B.
This is sometimes undesirable as the association is not necessary for the application. This is particularly true of ToMany associations.
An alternative approach would be to let the user express a query instead of materializing an association.
I think that's a not too complicated feature that can be done by a community member.
Yoann: it is complicated, because the query approach makes little sense if we don't provide a way to "chunk" reindexing, i.e. a way to retrieve a batch of entities from the query, reindex them, flush and clear the session, continue to the next batch, etc. If we don't do that, this query mechanism will only make sense for associations with a small cardinality (an association with 8000 linked entities is a no-no). But such a chunked reindexing is hard (impossible?) to implement properly:
we would ideally want it to read from the user session's cache, but never, ever write to the user sessions's cache (in particular the clear() calls should not remove entities that the user expects to be in the session).
depending on the number of entities to reindex, the reindexing process could take a lot of time. We may want to make reindexing happen in a background process instead of the user session.
In conclusion, it's likely that such a feature only makes sense if we implement asynchronous processing of entity change events (). Since integrating the two features would likely require API changes, I'd rather work on this ticket after is solved.
We might want to abstract away from a query string and use an interface / implementation so that Hibernate Search working in non ORM environment can still benefit from this approach. I haven't thought much about this abstraction.