Aggregations on multi-valued numeric fields for Lucene

Description

See how org.hibernate.search.integrationtest.backend.tck.search.aggregation.SingleFieldAggregationBaseIT#multiValued is disabled due to org.hibernate.search.integrationtest.backend.lucene.testsupport.util.LuceneTckBackendFeatures#aggregationsOnMultiValuedFields.

Before HSEARCH-3839, we couldn't even index multiple values for numeric fields in Lucene. After HSEARCH-3839, we can, but we pick a single value when aggregating, so aggregations are still incorrect.

Ideally, when counting documents per field value, multi-valued documents should be counted once per value that appears in the field. So if a single document has values 1 and 2 for a single field, it should increment the count for both 1 and 2. At least that's what happens on Elasticsearch.

How to test the behavior on Elasticsearch:

curl -XDELETE -H "Content-Type: application/json" localhost:9200/mytest1/ 1>&2 2>/dev/null; curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/\?pretty -d'{"mappings":{"properties":{"num":{"type":"integer"}}}}' curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/_doc/1 -d'{"num":1}' curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest1/_doc/2 -d'{"num":[1,2]}' curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest1/_search\?pretty -d'{"aggs":{"foo":{"terms":{"field":"num"}}}}'

Result:

{ ... "aggregations" : { "foo" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 1, "doc_count" : 2 }, { "key" : 2, "doc_count" : 1 } ] } } }

So document 2 was counted twice.

Activity

Show:

Yoann Rodière September 25, 2020 at 10:11 AM

This was fixed in commit 4a0d774c7620a5a21108e880854c0e2b268f4cf0 as part of HSEARCH-1929.

Closing as duplicate.

Waldemar Kłaczyński March 6, 2020 at 6:23 PM

Currently, LongRangeFacetCounts and DoubleRangeFacetCounts do not support such capabilities when it comes to flat index structure. For nested structures, this could be simulated by disabling NestedProvider. Then it could work so that it would aggregate separate values from nested documents. But also you can make the new LongRangeFacetCounts and DoubleRangeFacetCounts so that it works according to our assumptions.

Yoann Rodière March 6, 2020 at 7:42 AM

Now there are actually four aggregation options for nested documents and five options for flat documents. But you can add "none", or if you don't set it possible, all fields can be agitated without performing linking functions on them.

Yes, that's the plan. By default, I don't thing we should "per-document aggregations" (sum, avg, lowest, etc.) in aggregations, so as to behave consistently:

  • Between string aggregations and numeric aggregations: we can't sun/avg/... for strings, and lowest/highest don't make much sense for terms found in text.

  • Between Lucene numeric aggregations and Elasticsearch numeric aggregations: Elasticsearch takes into account all values by default, not the sum/avg/lowest/etc.

Also, I don't think we can request per-document sum/avg/lowest/etc. for numeric terms/range aggregations in Elasticsearch, so we can't expose the feature in generic APIs that both Elasticsearch and Lucene must implement. We could move it to Lucene-specific APIs, I suppose, but there isn't really a use case, is there? You just implemented this so that aggregations would somehow work on multi-valued fields?

You can practically set the sorting option to none. But it would have to return as many document repetitions as the nested or duplicate values in the flat model field.

Yes, some documents would be counted multiple times. That's what Elasticsearch does by default, and I think it's a decent default.

Especially if paging is used.

Paging is not relevant for aggregations, which are applied on the whole index.
I don't think performance is an issue here, if that's what you're suggesting. The problem is more that we have to move away for our "legacy" implementation of aggregations that relied on Lucene's faceting.

Anyway, this is all something I'm suggesting to do as a second step. After your work, sorts on multi-valued fields work correctly, and aggregations on multi-valued fields work correctly as long as there is effectively only one value per document (which will probably be the case once you add filtering anyway).

Waldemar Kłaczyński March 5, 2020 at 4:51 PM
Edited

Now there are actually four aggregation options for nested documents and five options for flat documents. But you can add "none", or if you don't set it possible, all fields can be agitated without performing linking functions on them.


You can practically set the sorting option to none. But it would have to return as many document repetitions as the nested or duplicate values in the flat model field. Especially if paging is used.

Duplicate

Details

Assignee

Reporter

Components

Priority

Created March 5, 2020 at 3:59 PM
Updated September 25, 2020 at 10:16 AM
Resolved September 25, 2020 at 10:11 AM