FileNotFoundException during slave synchronization with source

Description

The current file synchronization mechanism in the FSSlaveDirectoryProvider has the potential to throw a FileNotFoundException if the source file it's copying is deleted before the copy operation completes.

https://forum.hibernate.org/viewtopic.php?f=9&t=1007801&p=2437680#p2437680

Activity

Show:

Yoann Rodière September 26, 2023 at 9:37 AM

Hello,

This issue was reported against Hibernate Search 5.x or earlier, and doesn't seem to be relevant to newer Hibernate Search versions (6.x+).

In order to focus on newer versions, we are going to close this issue.

If you are still affected by this issue on Hibernate Search 6.0 or later, and have a reproducer, please comment here or reach out to us so we can reopen the issue.

If you are still affected by this issue on Hibernate Search 5.x, and want to provide a fix, please comment here or reach out to us so we can work out the next steps with you.

Cheers,
Yoann

Michael Mogley October 29, 2010 at 8:02 PM

I would love to contribute a patch. Unforunately, I don't have bandwidth for this at the moment. But here's what I'll say. The solution needs to be blackboxed. I agree any good application needs monitoring, but I don't want to have to understand the details of the Hibernate Search syncing algorithm. I just want it to work. I propose a couple of solutions.

1) Implement directory locking. To get around the issue of a failed slave not giving up the lock, I would implement leased locking, whereby the slave leases the lock for a time it specifies (could be configurable). The lease expiration time could be embedded in the lock file. A thread on the master would periodically check existing locks and forcibly release them if expired.

2) Instead of refreshing on an interval, refresh on a cron schedule. I've actually already implemented a modified FSMaster/SlaveDirectoryProvider to do this. This allows me to guarantee that both master/slave refresh times are scheduled around each other, and does not depend on when either one was started.

Hardy Ferentschik October 29, 2010 at 5:25 PM

Since I haven't seen this mentioned - the problem only occurs on start(). If this situation occurs after the directory is started the error is logged and the slave index directory does not get switched - skipping one synchronization. On start() the assumption is that the slave has to be able to successfully get an initial copy of the index. Depending on when master and slave are getting started this could even happen with "reasonable" refresh properties.
One solution would be to add some retry operation for start(). is it worth it? Maybe it is better to fail early. In a proper application setup you would (hopefully) have some sort of monitoring/startup harness around your apps.

Sanne Grinovero October 29, 2010 at 10:54 AM

The workaround seems easy enough to give this a very low priority, and is actually also good practice to have a master refresh time lower than the slave - pointless to create indexes which aren't consumed) I agree this shouldn't be a problem to be postponed.
Michael? If you need this fixed we can give pointers and advice, but you should propose a patch
In any case I'd avoid using locks or marker files from the slave, as a crashing slave would prevent the whole cluster to make any progress.
An easy improvement that could be applied is to improve the error only: make sure to catch this exception (or check for file existence before copying) and have the slave retry - but even then a very loud error should be logged, as there's a very high likelyhood that in such a case the client will loop in the error. Actually I think it should retry already, so the patch should just throw a more explicit error message.

Also keep in mind that new clustering strategies are coming soon:

Emmanuel Bernard October 29, 2010 at 10:22 AM

In all transparency, if nobody steps up and propose a pull request ( http://github.com/hibernate/hibernate-search ) / patch. This one is unlikely to make it to 3.3.0
The window of possibilities is a bit too small to make this bug fix a priority (needs two sync periods before being doable)

Out of Date

Details

Assignee

Reporter

Components

Fix versions

Priority

Created October 28, 2010 at 10:33 PM
Updated September 26, 2023 at 9:37 AM
Resolved September 26, 2023 at 9:37 AM