Two-phase projections in the Lucene backend

Description

Currently, we implement projections by adding collectors next to the TopDocsCollector.

The problem with this strategy is that collectors will then be applied to all documents in the index.

It's not even just the competitive documents (those that have a score higher than the lowest document in the priority queue when they are visisted): as we can see in MultiCollector, joining multiple collectors together will disable score-based optimizations that would allow skipping some documents along the way.

As a result, the distance collector for example will need to store in memory as many results as the total number of matching documents in the index. Regardless of the limit passed to fetch(limit). Which is ridiculous.

Worse, starting with recent changes, the default projection that only retrieves document IDs will always add a collector next to the TopDocsCollector... and this collector will always build a list as large as the total number of matching documents in the index.

For. Each. Single. Search.

We should switch to a two-phase approach:

  1. First phase: search.search() call that inspects all documents to collect the top docs and their score (TopDocsCollector), and if necessary the aggregations (FacetsCollector).

  2. Second phase: explicit collection that inspects only top docs to extract data from docvalues (DistanceCollector) or from storage (reader.document(...) using the StoredFieldVisitor: even if the javadoc of Collector advises against it, in this case it would be fine). Maybe we can use collectors, but a different abstraction would be fine, since we do not need to perform a search, but rather to inspect a pre-determined set of documents.

Note that solving this ticket should fix HSEARCH-3786.

Environment

None

Assignee

Yoann Rodière

Reporter

Yoann Rodière

Labels

None

Suitable for new contributors

None

Feedback Requested

None

Components

Fix versions

Affects versions

Priority

Critical
Configure