Pass the number of entities to index to the monitor right from the start

Description

We currently pass the number of entities to index to the monitor in several steps: each time we start processing another entity type, we perform another call to MassIndexerProgressMonitor#addToTotalCount, and we expect users to update the total.

The problem is, depending on the settings and the work load, we could theoretically end up updating the total a few hours after the indexing started. In such a situation, any attempt to use the total to give an idea of progress is doomed. Think about it: progress would dive from 90% to 50% as soon as we would start processing the next entity type.

And yes, the problem actually affected at least one person: https://stackoverflow.com/questions/46936481/how-to-get-the-statistics-of-hibernate-lucene-index-creation-in-gui/46953465#comment81088171_46953465

I think that, at the very least, we should make sure to compute the total numbers when we start processing, regardless of how many types we were instructed to index in parallel.

Sure, we would run into consistency issues, because we would necessarily compute the total in a different transaction than the one in which we would retrieve the entities (so the final number of entities indexed could differ from the initially advertised number), but I think this would still be better than the current situation. Users can deal with a progress bar at 97% when we call MassIndexerProgressMonitor#indexingCompleted, or with progress going over 100%, especially if the errors are small. But the progress going from 90% to 50% is really an issue.

Maybe we could add an option to let the user choose between two strategies? Or even better: we could somehow amend the total during indexing, calling for example monitor.addToTotalCount( -1 ) if, after we open the "indexing" transaction, we realize we will index one less entity than expected.

Environment

None

Activity

Show:
Sanne Grinovero
November 6, 2017, 12:40 PM

One problem with these "progress bar" approaches - and also our current implementation - is that often the full SQL COUNT we have to perform in advance in many databases will take as long as the whole process itself.
The reason being that to implement the proper isolation semantics expected by a standard COUNT it needs to somehow iterate all data; fetching it at the same time only adds a small additional cost which often is negligible. So we do that first, and then repeat again for the itaration.. doubling the total time it takes just to have a reasonable estimate half way - we could be done by that time.

When thinking about these evolutions, I think we need to take into consideration an alternative approach which skips the count altogether - for many of my personal experiments I regularly find myself commenting out some code, obvsiouly there could be better alternatives.

Yoann Rodière
November 15, 2017, 12:34 PM

The reason being that to implement the proper isolation semantics expected by a standard COUNT it needs to somehow iterate all data; fetching it at the same time only adds a small additional cost which often is negligible. So we do that first, and then repeat again for the itaration.. doubling the total time it takes just to have a reasonable estimate half way - we could be done by that time.

I doubt finding and fetching a million rows over the network would cost the same as finding them and just returning the count... Especially when not using any WHERE clause, or when relying on database indexes in the WHERE clause.
But I agree the extra cost may bother some users, so let's not force it on them.

When thinking about these evolutions, I think we need to take into consideration an alternative approach which skips the count altogether - for many of my personal experiments I regularly find myself commenting out some code, obvsiouly there could be better alternatives.

Right... Two ideas (non-exclusive):

  • we could just skip unnecessary counts when no monitor is provided by the user

  • we could, as I suggested in the ticket description, add switches so that users select the "monitoring strategy" explicitly.

Assignee

Unassigned

Reporter

Yoann Rodière

Labels

None

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Priority

Major
Configure