Create a proper API for sorting definitions
Description
depends on
is duplicated by
is followed up by
Activity
Yoann RodièreAugust 26, 2016 at 9:49 AM
@Emmanuel Bernard Thanks. It seems the PR there will fulfill your requirements: https://github.com/hibernate/hibernate-search/pull/1139
Emmanuel BernardAugust 26, 2016 at 9:15 AM
I think I've caught up with the status, let me know otherwise. And thanks for the relay
Emmanuel BernardAugust 26, 2016 at 9:14 AM
Where is the API going
The most likely scenario is that QueryBuilder
will be made generic in 6 to output:
Lucene
Query
or Elasticsearch JSON when createQuery is calledLucene
Sort
or Elasticsearch JSON when createSort is calledbe
byNative(SortField)
for Lucene andbyNative(String)
for ES
It's not rebuilt from the ground up, just generified to meet our needs. We did explore in Rome something that integrate the DSL and the createQuery
but it was not destroying the DSL, just adding more leaves options if I recall.
Spatial
Ok @Nicolas Helleringer, let's keep withComputeMethod
for later just making usre the DSL could be expanded.
On byDistanceFromSpatialQuery(Query)
I'm fine with dropping it.
Native
I still feel native should be in with a oveloaded method byNative(SortField)
and byNative(String)
, the former only usable with Lucene, the latter only with Elasticsearch. Yes using a custom SortField
to keep the JSON string payload seems fine to me for the time being. In 6, we will be able to introduce a generic type bedending ont he backend targeted. So yes it looks hacky but that's temporary. Make it experimental if you have concerns.
Do we want to support sorting on unmapped fields
Doesn't native gives us a way out?
Anyways, I'm fine with restricting for now.
Nicolas HelleringerAugust 23, 2016 at 8:13 AM
Elasticsearch lets you sort by distance between a point and a coordinate in a document through several methods. SLOPPY_ARC (default), ARC and PLANE. Do you think that is implementable directly via the Lucene Sort/SortField classes like you did DistanceSortField / DistanceComparator.
Is that worth it?
And what is the current implementation doing?
If I do understand the Elasticsearch doc well, ARC is a mathematically correct spherical distance, SLOPPY_ARC is a "faster" computation method and PLANE is planar distance with local projection around the center of the search.
I do think that trying to save some CPU cycle here is overkill and not needed in most cases but I think it is globally doable to have a support for multiple distance computation methods in Hibernate Search.
The current implementation is doing accurate spherical computation.
I need to grab some time to go deeper into Elasticsearch and Hibernate Search code to compare if you need to.
Yoann RodièreAugust 12, 2016 at 10:20 AM
Where is this API going?
Our ultimate goal is to be more independent of Lucene internal API. Currently, most of our machinery is based on Lucene API (we build our index with Lucene Documents for instance) and this has to change.
It's a Search 6 thing though.
Ok. So this API is bound to disappear in its current form.
Is this API going to support native sorts?
Native sorts are important. The fact is that Elasticsearch brings a lot of power to the user and it is really important to allow the user to use this power.
I completely agree with that. I was just saying that maybe we should try to provide a way to chain DSL-built sorts with native sorts (through another API, provided only by the ES module) instead of adding backend-dependent methods to the DSL.
But if the API is to be rebuilt from the ground up in 6, well, I guess I can live with that.
Should we keep andByIndexOrder/andByScore
+1 to keep them. I also thought they might be useful.
Will do. I will also add a way to chain an andByField
after a byScore
, just in case.
Should we keep in(Unit)
Maybe we could simply provide a fixed value for Elasticsearch (probably meter) and document that?
In the long term, we will probably need to be able to define backend specific options but it might not be a priority for now.
Agreed, a fixed value seems the way to go. I'll check out what the unit for our FullTextQuery.SPATIAL_DISTANCE is, but it seems to be kilometers (org.hibernate.search.query.engine.impl.DocumentExtractorImpl.extract(int)
=> org.hibernate.search.query.engine.impl.QueryHits.spatialDistance(int)
=> org.hibernate.search.spatial.impl.DistanceCollector.HitEntry.getDistance(Point)
=> org.hibernate.search.spatial.impl.Point.getDistanceTo(double, double)
). A bit odd, but changing it would be an API break, so let's be consistent...
Is the byDistanceFromSpatialQuery(Query) method making sense
It's a useful shortcut. If we can provide a useful error message if the user is not providing a spatial query, I think it's worth it. If it's going to be cryptic, probably not.
Not sure it's really a shortcut, that was my point. Without it, we'd have this kind of code:
Query luceneQuery = builder.spatial()
.onField( "location" )
.within( 100, Unit.KM )
.ofLatitude( centerLatitude )
.andLongitude( centerLongitude )
.createQuery();
FullTextQuery hibQuery = fullTextSession.createFullTextQuery( luceneQuery, POI.class );
Sort sort = builder.sort()
.byField( "location" )
.fromLatitude( centerLatitude )
.andLongitude( centerLongitude )
.createSort();
hibQuery.setSort( sort );
And with it, that kind of code:
Query luceneQuery = builder.spatial()
.onField( "location" )
.within( 100, Unit.KM )
.ofLatitude( centerLatitude )
.andLongitude( centerLongitude )
.createQuery();
FullTextQuery hibQuery = fullTextSession.createFullTextQuery( luceneQuery, POI.class );
Sort sort = builder.sort()
.byField( "location" )
.byDistanceFromSpatialQuery( luceneQuery )
.createSort();
hibQuery.setSort( sort );
So we gained one line of code. Out of a dozen. And we lost in clarity, though that's a matter of opinion.
Anyway, I'll see if we can at least implement this cleanly.
Do we want to support sorting on unmapped fields
I think it can be considered an acceptable limitation to always sort on common fields. I wouldn't enforce it though so that it continues to work as before for Lucene.
It would be something to keep in mind for Search 6.
Ok to support only common fields. But as to "work as before", the DSL didn't exist "before", so it will only mean that users can provide their own SortField
through the byNative
methods.
Thanks!
Guillaume SmetAugust 12, 2016 at 9:08 AM
Where is this API going?
Our ultimate goal is to be more independent of Lucene internal API. Currently, most of our machinery is based on Lucene API (we build our index with Lucene Documents for instance) and this has to change.
It's a Search 6 thing though.
Is this API going to support native sorts?
Native sorts are important. The fact is that Elasticsearch brings a lot of power to the user and it is really important to allow the user to use this power.
Here is what Emmanuel answered when I raised this very issue:
We could do that but I think that should tie with the generification of the DSL to support different backend outputs (namely Lucene, ES and Solr. But that one can only happen in 6.
So we should do it the dirty way for now and revisit that later.
Should we keep andByIndexOrder/andByScore
+1 to keep them. I also thought they might be useful.
Should we keep in(Unit)
Maybe we could simply provide a fixed value for Elasticsearch (probably meter) and document that?
In the long term, we will probably need to be able to define backend specific options but it might not be a priority for now.
Is the byDistanceFromSpatialQuery(Query) method making sense
It's a useful shortcut. If we can provide a useful error message if the user is not providing a spatial query, I think it's worth it. If it's going to be cryptic, probably not.
Do we want to support sorting on unmapped fields
I think it can be considered an acceptable limitation to always sort on common fields. I wouldn't enforce it though so that it continues to work as before for Lucene.
It would be something to keep in mind for Search 6.
Yoann RodièreAugust 11, 2016 at 4:06 PM
Taking on the work of Emmanuel, I'm now working on this on my own branch (https://github.com/yrodiere/hibernate-search/tree/HSEARCH-1872).
I just created https://hibernate.atlassian.net/browse/HSEARCH-2320#icft=HSEARCH-2320 for everything that will be addressed later on (see the ticket for details). We can always create sub-tasks when the time comes to address those.
Summing up, here is the remaining work:
Making things work in -engine (the tests are failing right now)
Ensuring things work in -elasticsearch
Using proper exceptions
Taking some decisions (see below)
Below are the subjects on which we should take decisions.
Where is this API going?
Is this API going to be, on the long term:
an independent way of building
org.apache.lucene.search.Sort
objects, that will later be provided when executing a query?an independent way of building
HibernateSearch
-defined sort objects (say,HSSort
, like theHSQuery
), that will later be provided when executing aHSQuery
?
Solution 1 seems workable, since the ElasticSearch module currently converts Sort objects to its internal representation, but this discussion makes me doubt that reliance on Lucene types is a long-term objective.
Solution 2 also seems workable, but would be part of a larger work on HibernateSearch APIs, which I'm not aware of (yet).
Is this API going to support native sorts?
See the discussion here: https://github.com/hibernate/hibernate-search/pull/1132#discussion_r72258266
In my opinion, supporting native sorts in this API would be a bit annoying, because the API is currently returning org.apache.lucene.search.Sort
, which is not at all a good fit for putting ElasticSearch JSON sort definitions in it. We could either work it around (having a specific SortField
with a JSON string in it, and a dummy field name), which seems a bit messy, or convert the JSON to a SortField
, which would defeat the purpose of allowing to provide native JSON in the first place, since it would prevent users from using ElasticSearch features not yet supported in HibernateSearch.
Also, the API will IMO be less elegant, since we'll have a method that accept a parameter whose content (be it a plain String or an object) is not clearly constrained. That would normally be a good candidate for a type parameter (TNativeSort
), but given the way we obtain the sort builder, we cannot have an implementation-dependent type parameter. So whatever we do, we won't be able to get much better than exposing a byNative(Object)
method that tells close to nothing to the user. A shame for a domain-specific language.
Maybe we should take the problem the other way around: somewhere in the ElasticSearch module, allow for concatenation of DSL-built sorts and plain JSON strings? Or, maybe, add setSort(Sort ... fields) methods to queries, and add something
Should we keep andByIndexOrder
/andByScore
Those should remain, IMO, because:
Implementation is trivial (as long as Lucene supports that kind of sorts, which seems likely)
They might be useful, particularly
andByScore
. I've provided a use case in the tests: sorting first by a field for which many documents have the same value (a category), and only then by relevance. So you'll have blocks of results, with the elements in each block being sorted by relevance. Granted, it's exotic, but business users are definitely able to come up with such a use case. And when business users want that kind of things in their whole application, it's frustrating being prevented from using the DSL for such a minor detail.Removing them would go against the principle of least surprise. If everything we provide as
by*
method is also provided asandBy*
method, why not those two?
Should we keep in(Unit)
I'd say we should not keep this in the DSL, since it's not supported out-of-the-box for vanilla Lucene queries. All methods in the DSL should be implemented for every backend.
Maybe we can create another ticket for adding that kind of feature to locale Lucene indexes. Anyone find this useful?
Is the byDistanceFromSpatialQuery(Query)
method making sense
The point of this method would be to take a query distance (distance to these coordinates must be below 1km, for instance), extract the point from it, and create a sort by distance to this point.
On the plus side, this relieves the user from passing the reference coordinates twice.
On the minus side, the method could break at runtime: we ask for a Query
, which may or may not have a distance filter, and as such introduces new ways for users to make things explode.
I wonder if it's such a great pain to provide the coordinates twice. Isn't it how it's done almost everywhere? I mean, when you write an ElasticSearch query by hand, or a SolR query, you also have to provide these coordinates twice. Besides, we're only changing the thing to provide twice, here: instead of providing the coordinates to HS twice, we provide the query to HS twice...
I don't see any way to do it better than this, so I'd say we drop it. It's not implemented yet, anyway.
Do we want to support sorting on unmapped fields
When querying multiple indices at once, it may happen that some field we're sorting on is absent from one of the indices. In that case, ElasticSearch requires that type information be provided.
So in that case, we may have to retrieve type information from the metamodel. Problem: the QueryBuilder
, from which we start building the sort, is only aware of one entity we're working on. This might be something to fix, because queries built this way may (and will) still be ran against multiple indices, but that's another matter.
So, let's say the user is querying two indices: A
(fields: z
, a1
, a2
), B
(fields: z
, b1
, b2
), and he wants to sort on z
, then a1
, then b1
. The user won't be able to build his sort with the QueryBuilder
, since whichever entity he chooses to bind his QueryBuilder
to, there would be some missing type information at some point (for either the a1
field or the b1
field).
Now, we can decide either to drop support for that kind of sort in the DSL (and that should be documented), or to provide support for {{QueryBuilder}}s spanning multiple types (say, they would use the first type definition they find on any of the entities they would be bound to). Or both.
Any thoughts would be greatly appreciated!
Emmanuel BernardJuly 27, 2016 at 3:21 PMEdited
From this version, we need to make decisions and remove code if necessary on:
treatMultiValuedAs
=> probably remove now (withMultiValuedMode
) and open an issue to support sortable multivalues at Hibernate Search levelremove
onMissingValue
for distance sort and open an issue to support missing values for distance or at least first/lastRemove
withComputeMethod
on distance sorts and open an issue to support the various compute methods (pending @Nicolas Helleringer 's feedback.remove
withComparator
until we understand the use case and address compatibilitydecide on the support for native (I think it should go)
make a decision on
andByIndexOrder
(this could go away IMO)decide whether we want to support
SortField.Type.STRING_VAL
: I think we can ask people to use the native api for that use casemake a decision on
andByScore
(this could remain IMO)should we keep in(Unit)
I am done with it on my side. Any taker to drive this one home?
Emmanuel BernardJuly 27, 2016 at 3:13 PMEdited
I've pushed a compiling implementation of the DSL and the API we discussed. sha1 of the commit 179708d10
What is missing:
review the TODOs in the code and make a decision
handle exceptions when the field name does not exist and if sortfield.type is detected
decide if we want to let the DSL accept the type explicitly via something like
onField("foo")withType(FieldSettingsDescriptor.TYPE,NumericFieldSettingsDescriptor.NumericEncodingType)
(note quite sure what to pass otherwise to make it non Lucene dependent).convert my quick new SearchException into a proper exception (maybe also AssertionFailure)
write some tests
Emmanuel BernardJuly 27, 2016 at 2:48 PM
What might be missing in the API:
offer option to compare by String vs String_val ? See
SortField.Type
Especially useful for the DSL, but not exclusively as we apply sorting as an independent component from the query definition.
Important to possibly decouple from the Lucene engine in local mode, and also to abstract the user application code from the Lucene API changes.
However while abstraction has some benefits, let's make sure that no functionality is taken away from people having deep Lucene understanding.