MassIndexer with an update mechanism

Description

A feature which is already discussed in https://forum.hibernate.org/viewtopic.php?f=9&t=1014063:
It would be great to see an update mechanism instead of an index wipe/rebuild. I got a lot of data (> 17 mio rows) which takes a long time (> 2 hours) to index, which is needed because I don't only manipulate the data with hibernate. While the massindexer is rebuilding the index, the search will miss some of the rows which are not indexed yet, which is not acceptable for me.

Instead of wiping the index and re-adding all rows, update only the changed ones (new, updated, deleted).

The current process is:
1) wipe out the index
2) Add again all entities from the database, loading and processing them with multiple threads

but 2) could be replaced by an update instead of an add operation. But then as a new 3) step, it should look for entries/rows which are deleted from the database and remove them from the index too.

the 3) phase is not top priority for me but would possibly lead other people to use this approach instead of the wipe/reindex procedure (for large datasets). Maybe it can be split, to have an operation only to update the index (without delete) and a second operation to delete already deleted data (on database) from the index.

The whole operation doesn't need to be as fast as the wipe/reindex operation.

Environment

None

Activity

Show:
Chris Cranford
July 10, 2013, 5:17 PM

That initial version would certainly be a step in the right direction.

I might have to work around the deletion limitation case by either treating deletes as updates in the eyes of Hibernate Search with a status field and simply eliminate the deletions via application rules applied on the queries but I would expect that might not be so efficient.

Another alternative would be after the mass indexer has ran in update mode, I use a ScrollableResult to iterate the entries with deleted status and remove them from the database. That would effectively trigger the index removal in Hibernate Search to update the Lucene index too. This would lead to the index bloat you mentioned.

I don't have any concrete numbers of the records that would be deleted on a daily basis, so it's hard to say specifically. But I can't imagine that the deletions would exceed several thousand entries compared to hundreds of thousands that would be updated or inserted on a daily basis. Do you have any recommendations on how best to deal with deletions that would least impact performance?

Chris Cranford
July 11, 2013, 10:00 PM

Would it not be viable to do the rebuild in another directory (className.tmp) and do some file system lock/rename magic once the rebuild is finished. Hibernate search would continue to use the class name (non-temp) directory until the rebuild is completed. You'd probably need to have some lock mechanism so that searches would block while the MassIndexer renamed the directory names once the index process had finished but this should ultimately be a lock that isn't held for more than a few milliseconds.

Sanne Grinovero
July 11, 2013, 10:57 PM

I might have to work around the deletion limitation case by either treating deletes as updates in the eyes of Hibernate Search with a status field and simply eliminate the deletions via application rules applied on the queries but I would expect that might not be so efficient.

Hibernate Search already does something similar: it removes from resultset elements which are not matching a valid entry in the database. That's what I was referring to when I said above "deletions don't affect the actual results but could slow down the queries by bloating the index": and you're right it is not ideal from a performance point of view, but at least it doesn't impact correctness of functionality.

On the rebuild in another Directory: yes we discussed similar approaches and I'm inclined to agree that it might be the more effective solution. The downside is that it would require specific support in multiple subsystems and will break user extension points like custom IndexManager or DirectoryProvider_ implementations. I'll need to explore some to see the impact but indeed it's hard to think of a better strategy

Sanne Grinovero
July 12, 2013, 1:55 PM

Some thoughts sent to the mailing list:
http://search.jboss.org/#!q=%22HSEARCH-1032%22&project=hibernate&mailList=dev

Yoann Rodière
July 2, 2019, 9:01 AM

Note a PR was submitted for Search 5, but it relied on the native Criteria that are now deprecated and (I think) removed in ORM 6, so another approach is necessary. We should definitely take inspiration from that PR, though: https://github.com/hibernate/hibernate-search/pull/1114

Assignee

Unassigned

Reporter

Marcel

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Priority

Major
Configure