Skip non-competitive hits when running a MatchAllDocsQuery with descending score sort with Lucene
Description
Currently, when we run a MatchAllDocsQuery with Lucene (i.e. the user requested a matchAll predicate and there are no nested documents), we scan the whole index.
One of the reasons we scan the whole index is to count all hits. But there's a much more efficient method of counting hits when executing a MatchAllDocsQuery: IndexReader.numDocs(). If we were to use that method, we could disable the counting of hits during the search (by setting the total hit count threshold to 0), removing one reason to scan the whole index.
Then, when users are in this situation and request a sort by descending score, Lucene optimizations would theoretically be able to kick in to make sure that we only scan competitive hits. What's interesting is that (in theory) for a MatchAllDocsQuery, absolutely no hit is competitive after we collected the N first hits requested by the user. That's because the MatchAllDocsQuery assigns the same score to all hits.
In best-case scenarios, we could end up scanning a dozen elements instead of millions, yielding a massive performance improvement.
That's interesting in particular because one of the queries included in Infinispan's performance tests does just that: a MatchAllDocsQuery with a descending score, on an index with many documents.
The only unknown here is whether Lucene enables these optimizations with MatchAllDocsQuery, or generally with constant-score queries. Theoretically that's possible, but whether it's been implemented remains to be seen.
Currently, when we run a MatchAllDocsQuery with Lucene (i.e. the user requested a
matchAll
predicate and there are no nested documents), we scan the whole index.One of the reasons we scan the whole index is to count all hits. But there's a much more efficient method of counting hits when executing a
MatchAllDocsQuery
:IndexReader.numDocs()
. If we were to use that method, we could disable the counting of hits during the search (by setting the total hit count threshold to 0), removing one reason to scan the whole index.Then, when users are in this situation and request a sort by descending score, Lucene optimizations would theoretically be able to kick in to make sure that we only scan competitive hits. What's interesting is that (in theory) for a
MatchAllDocsQuery
, absolutely no hit is competitive after we collected the N first hits requested by the user. That's because theMatchAllDocsQuery
assigns the same score to all hits.In best-case scenarios, we could end up scanning a dozen elements instead of millions, yielding a massive performance improvement.
That's interesting in particular because one of the queries included in Infinispan's performance tests does just that: a
MatchAllDocsQuery
with a descending score, on an index with many documents.The only unknown here is whether Lucene enables these optimizations with
MatchAllDocsQuery
, or generally with constant-score queries. Theoretically that's possible, but whether it's been implemented remains to be seen.