Skip non-competitive hits when running a MatchAllDocsQuery with descending score sort with Lucene

Description

Currently, when we run a MatchAllDocsQuery with Lucene (i.e. the user requested a matchAll predicate and there are no nested documents), we scan the whole index.

One of the reasons we scan the whole index is to count all hits. But there's a much more efficient method of counting hits when executing a MatchAllDocsQuery: IndexReader.numDocs(). If we were to use that method, we could disable the counting of hits during the search (by setting the total hit count threshold to 0), removing one reason to scan the whole index.

Then, when users are in this situation and request a sort by descending score, Lucene optimizations would theoretically be able to kick in to make sure that we only scan competitive hits. What's interesting is that (in theory) for a MatchAllDocsQuery, absolutely no hit is competitive after we collected the N first hits requested by the user. That's because the MatchAllDocsQuery assigns the same score to all hits.

In best-case scenarios, we could end up scanning a dozen elements instead of millions, yielding a massive performance improvement.

That's interesting in particular because one of the queries included in Infinispan's performance tests does just that: a MatchAllDocsQuery with a descending score, on an index with many documents.

The only unknown here is whether Lucene enables these optimizations with MatchAllDocsQuery, or generally with constant-score queries. Theoretically that's possible, but whether it's been implemented remains to be seen.

Activity

Show:
Fixed

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Created October 21, 2020 at 12:49 PM
Updated November 3, 2020 at 10:19 AM
Resolved October 22, 2020 at 3:07 PM