JSR-352: Allow to select the entities to be re-indexed through a HQL/JPQL query

Description

https://github.com/mincong-h/gsoc-hsearch/issues/146

Remaining issues:

  1. Shouldn't we expect the HQL/Criteria to produce an ordered list of IDs instead of what's done currently? It would in particular allow to use partitioning, and it doesn't seem much harder to use => Actually no, that's a bad idea, since we're not sure all IDs in the resulting bounds would be relevant. On the other hand, if we passed an offset and a limit to partitions, instead of passing a first and last ID, we would solve lots of practical issues we're having... (partitioning for HQL/Criteria, support for embedded IDs, ...). But maybe there was a particular reason to do it that way?

  2. Can this work with checkpoints? We should fix HSEARCH-2616, add checkpoints for the HQL and Criteria cases and test it

  3. The "maxResults" parameter is questionable:

    • why would we only use it when HQL/Criteria is used? Couldn't we simply have something similar to org.hibernate.search.MassIndexer.limitIndexedObjectsTo(long)?

    • why is the limit arbitrarily set to 1 million by default? This could come as a surprise to users.

  4. The following limitations should be documented:

    • There's not query validation before the job's start. If the query is invalid, then the job will failed to process the partition plan on the 2nd step produceLuceneDoc.

    • The partitioning is disabled in this HQL approach. Why? Because allowing the parallel execution requires an ordered-by-entity-ID selection (through criteria or HQL), so that PartitionMapper can scroll this selection projected on ID and split it into multiple sub-selections. In each sub-selection, the bounds are limited by the lower bound and the upper bound. However, in HQL approach, the query is given by the user. There's no guarantee for any order.

Activity

Mincong HuangJuly 14, 2017 at 9:16 PM
Edited

To summarize the situation, we need to:

  • Document the current limitation about HQL/JPQL, where:

    • The parallelism is disabled (only a single partition is used);

    • There's no query validation;

    • The query order is not guaranteed;

    • The checkpoint is ignored—in case of job restart, we don't restart from the checkpoint, but from the very beginning instead.

  • Keep the current implementation about HQL, which is a single partition.

  • The current implementation does not work with checkpoints, and I don't know how to achieve it... I suppose the only solution is to intercept the HQL, and add the checkpoint into the WHERE clause of the query.

  • Ensure the ability to restart the job correctly under HQL/JPQL. The current implementation creates duplicate Lucene documents in case of restart, because there's no purge of indexed documents, and checkpoint does not work. My proposition is to use UpdateLuceneWork instead of AddLuceneWork as a workaround.

  • Clarify the questionable parameter maxResults:

    • Explain that it is NOT only used for customized index scopes (HQL/Criteria), but is used to all index scopes (Full/HQL/Criteria).

    • There's no limit by default

Yoann RodièreMay 15, 2017 at 8:16 AM

The approach of using "an offset + a limit to partitions" cannot guarantee the correct coverage of the indexation

Ok, thanks for the explanation. The concurrent insertion issue could probably be worked around with a final, "no-limit" partition, but the concurrent deletion would indeed be problematic.
We may still have similar issues with the approach "one ID interval per partition", most notably with concurrent deletions, but at least those issues do not present the risk of "cascading" to other, unmodified elements.

Anyway... I'm out of ideas. Let's just document the limitation (if it hasn't been documented already).

Also, about maxResults: agreed, a bit of documentation is required. In particular, it seems the limit is applied per entity (e.g. with maxResults=100 and 2 entities, you could end up indexing 200 entities), and that should be clear from the documentation.

Mincong HuangMay 13, 2017 at 4:50 PM
Edited

... if we passed an offset and a limit to partitions, instead of passing a first and last ID, we would solve lots of practical issues we're having... (partitioning for HQL/Criteria, support for embedded IDs, ...). But maybe there was a particular reason to do it that way?

The approach of using "an offset + a limit to partitions" cannot guarantee the correct coverage of the indexation. If there's any change in the database after the start of the job execution, e.g. insertion or deletion, the indexation will target to the wrong range and lead to missing data or duplicate data.

Here's an example: we want to index 3,000 entities having ID from 1 to 3000 via 3 partitions (1000 rows per partition). The offset is respectively 0, 1000, and 2000. The limitation of each partition is set to 1000 rows. And here're some of the cases:

  • If everything goes well, each partition will finish its partition correctly.

    • the 1st partition will index the range [1, 1000]

    • the 2nd partition will index the range [1001, 2000]

    • the 3rd partition will index the range [2001, 3000]

  • If the row of ID=500 is deleted in the database,

    • the 1st partition will index the range [1, 1001]

    • the 2nd partition will index the range [1001, 2000], the entity ID=1001 is duplicate.

    • the 3rd partition will index the range [2001, 3000]

  • If the row of ID=500A is inserted in the database (suppose that it's possible),

    • the 1st partition will index the range [1, 999], because 500A takes one place, and the 1000-rows-limit ends to ID=999.

    • the 2nd partition will index the range [1000, 1999]

    • the 3rd partition will index the range [2000, 2999], the entity ID=3000 is missing.

The "maxResults" parameter is questionable:
why would we only use it when HQL/Criteria is used? Couldn't we simply have something similar to org.hibernate.search.MassIndexer.limitIndexedObjectsTo(long)?

Actually, the "maxResults" parameter is not only used in partial indexation—the HQL or criteria approche. It is also used in full indexation: for each partition, the full indexation approach restricts the indexation range with a lower bound and a upper bound. The boundaries are added as criteria, and that's why we have the impression that the parameter is only used in HQL / Criteria, which is incorrect. In order to clarify the concept, we need to address the following TODO list:

  • Refactor the coding logic in org.hibernate.search.jsr352.massindexing.impl.steps.lucene.EntityReader

  • Add some Javadoc if needed

  • Correct the documentation in manual-index.asciidoc.

why is the limit arbitrarily set to 1 million by default? This could come as a surprise to users.

Yes, you're right. This is not a good value. Let's discuss and handle it in HSEARCH-2707.

Fixed

Details

Assignee

Reporter

Fix versions

Priority

Created March 2, 2017 at 3:51 PM
Updated December 3, 2024 at 11:53 AM
Resolved July 17, 2017 at 2:51 PM

Flag notifications