Default value for "minimumShouldMatch" is different between Elasticsearch and Lucene

Description

Let say we have following entity

Query constructed in following way

for given data set:

returns different results (0 documents) if
hibernate.search.default.indexmanager=elasticsearch
and different result (1 document) if
hibernate.search.default.indexmanager=directory-based

If you use HSEARCH-3534_lucene.patch against https://github.com/hibernate/hibernate-test-case-templates/tree/master/search/hibernate-search-lucene, noneOfShouldMatchedWithinBooleanQueryInsideFilter_differentResults_directoryBasedVSElasticsearch test will pass, but
when you use HSEARCH-3534_elasticsearch.patch against https://github.com/hibernate/hibernate-test-case-templates/tree/master/search/hibernate-search-elasticsearch/hibernate-search-elasticsearch-5
the same test will fail.

Maybe using should and must in filters like siblings does not make much sense, because should (in directory-based) clause will be ignored, because document must fulfill must criterias and therefor should is ignored, and makes more sense in queries while scoring the search, however I ran into the issue while migrating huge project where such cases appear time to time, due to dynamic queries creation.

Environment

None

Activity

Show:
Yoann Rodière
March 25, 2019, 3:35 PM

Thanks for the report and the test cases, now I see what you meant.

I pushed your patches to a fork of the repo for future reference: https://github.com/yrodiere/hibernate-test-case-templates/tree/HSEARCH-3534/

Now, the problem. If I understand correctly, the Elasticsearch team decided it would be a good idea for the boolean junctions to behave differently when they are nested under a filter/must_not clause than when they are not:

  1. If the bool query is in a query context and has a must or filter clause then a document will match the bool query even if none of the should queries match. In this case these clauses are only used to influence the score.

  2. If the bool query is a filter context or has neither must or filter then at least one of the should queries must match a document for it to match the bool query. This behavior may be explicitly controlled by settings the minimum_should_match parameter.

Source: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-bool-query.html

This effectively means that minimum_should_match defaults to 0 in the first case, and to 1 in the second case.

The thing is, this is completely arbitrary and not something we have in Lucene at all.

I can see three solutions:

  1. We change the Lucene backend to implement the same behavior. That might be a bit difficult to achieve, in particular when the user doesn't rely on our DSL. But more importantly, that will be surprising to people already familiar with Lucene.

  2. We change the Elasticsearch backend to work around these defaults and force Lucene's defaults instead. This will be surprising to people already familiar with Elasticsearch.

  3. We don't change anything, and simply document this oddity.

1 seems dodgy, but option 2 seems more reasonable. And a few tests show that it's possible. Let's try to do it, at least in 6.

Goran Jaric
March 25, 2019, 3:57 PM
Edited

Now you exactly now the place where I am in .

Yes, you elaborated it very accurate, thanks, and thanks for the fast replies so far!

I already started to investigate possibility of 2. option you suggested in the meantime, since it make most sense to me. Hopefully most of the people familiar with Elastichsearch would want to create query above in a way that they would nest should under separate bool query which would be siblings of must in the filter, and not make it siblings of it... For e.g.

Goran Jaric
April 2, 2019, 8:28 AM
Edited

It could be that this is isolated, edge case, since it is only reproducible under example I provided above, means it would be better to deal with it not too radical (should being in same junction with must under filter clause).

Fabio Massimo Ercoli
April 4, 2019, 9:31 AM
Edited

Hi .
Thanks for the issue. Yeah, sometimes the backends behave differently.

We're going to force the Elasticsearch backend to the Lucene's defaults. The solution #2 mentioned.
In particular, we're going to force the default minimum should match to 0 if the should has some must as a sibling and is inside a filter predicate.

Fix will be applied to the major 6.

Goran Jaric
April 4, 2019, 10:16 AM

Great! This was exactly my temporary fix implemented outside of hibernate -search-elasticsearch library.

Assignee

Fabio Massimo Ercoli

Reporter

Goran Jaric

Labels

None

Suitable for new contributors

None

Feedback Requested

None

Fix versions

Affects versions

Priority

Major
Configure