Fixed
Details
Assignee
Yoann RodièreYoann RodièreReporter
Yoann RodièreYoann RodièreComponents
Sprint
NoneFix versions
Priority
Critical
Details
Details
Assignee
Yoann Rodière
Yoann RodièreReporter
Yoann Rodière
Yoann RodièreComponents
Sprint
None
Fix versions
Priority
Created December 18, 2019 at 5:14 PM
Updated January 22, 2020 at 2:18 PM
Resolved January 6, 2020 at 12:57 PM
Currently, we implement projections by adding collectors next to the
TopDocsCollector
.The problem with this strategy is that collectors will then be applied to all documents in the index.
It's not even just the competitive documents (those that have a score higher than the lowest document in the priority queue when they are visisted): as we can see in MultiCollector, joining multiple collectors together will disable score-based optimizations that would allow skipping some documents along the way.
As a result, the distance collector for example will need to store in memory as many results as the total number of matching documents in the index. Regardless of the limit passed to
fetch(limit)
. Which is ridiculous.Worse, starting with recent changes, the default projection that only retrieves document IDs will always add a collector next to the
TopDocsCollector
... and this collector will always build a list as large as the total number of matching documents in the index.For. Each. Single. Search.
We should switch to a two-phase approach:
First phase:
search.search()
call that inspects all documents to collect the top docs and their score (TopDocsCollector), and if necessary the aggregations (FacetsCollector).Second phase: explicit collection that inspects only top docs to extract data from docvalues (DistanceCollector) or from storage (reader.document(...) using the StoredFieldVisitor: even if the javadoc of Collector advises against it, in this case it would be fine). Maybe we can use collectors, but a different abstraction would be fine, since we do not need to perform a search, but rather to inspect a pre-determined set of documents.
Note that solving this ticket should fix HSEARCH-3786.