Add support for '_routing' Elasticsearch parameter

Description

Consider adding support for the '_routing' parameter when doing CRUD operations against an ElasticSearch cluster. This can be a very effective way (in my opinion) to improve performance of search and update operations, and provide more control over isolating different domains of data in the index. See this link for the documentation of this parameter.

By default, without custom _routing, the document ID is used as a routing value to determine in which ES shard it needs to be indexed. The shard is selected by ES based on a formula which takes the configured shard count into consideration to spread the data over the available shards (reasonably) evenly. However, a user might want to isolate a set of documents into a single shard (determined by a discriminating property for example) and, knowing in which shard they are, he can search for documents in this set by explicitly querying their shard and no other shard. This can be done by using custom _routing. Multiple values can be used for this parameter to index a document in more than one shard for example.

Why do I need this? My use case is:
I am building an interface where users can segment a big set of data using custom-built filtering queries (using ES). Moreover, users can do full-text search and apply filters on it as they choose. Each user belongs to an organisation, and only has access to data in that organisation. I have millions of documents to index, with a couple entity types. I want to isolate data for a given organisation, and make search directed to the indexes and shards that store that data. I do not want to search all shards because it is inefficient to search in such big data set. So I split the data into multiple indices, each further split into shards.

Most of these documents are old and not very relevant for search. I put all of that data into an ARCHIVE index. The newer data is split into two LIVE indices, each containing data for half of the organisations. Each index is split into 3 shards, replicated once, so 6 per index. I want to put all the data of a single organisation in a single primary shard (and its replica). Then, when searching that data, I want to use custom routing to select that shard only.

I currently use custom routing successfully with my own manual integration with ES. I use a property contained in each document (organisation id) as the routing value. But I want to use Hibernate Search to sync data between my db and ES, because this is a task best suited for an ORM.

Activity

Show:

Yoann RodièreFebruary 8, 2021 at 9:58 AM

This is exactly how sharding is implemented in Hibernate Search 6. See https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#concepts-sharding-routing

Closing as out of date.

Ivan KrumovMarch 17, 2017 at 12:09 PM

You can work on it on your own, don't wait for my PR. I would like to help but I don't have enough free time to do this right now..
I am going to try to not use _routing for the time being.

Yoann RodièreMarch 17, 2017 at 10:31 AM

I however cannot upgrade to 5.8 or 5.7 right now, and Sanne Grinovero told me such things will not be backported.

Right... Well, be warned that we made some heavy changes in the internals in 5.8 already, in order to be able to support multiple Elasticsearch versions (2.x and 5.x). As a result we introduced our own high-level abstraction over the REST client. Thus it may not be straightforward to port your work to 5.8.

Please keep us updated though, so we know whether to work on this on our own or to wait for you PR! And of course, feel free to ask for advice should you need it.

Ivan KrumovMarch 17, 2017 at 9:07 AM
Edited

The code is not available publicly but I am considering exposing my ES client part (higher-level abstraction over the REST client or Transport client) and mass indexer. They are working fine, only I cannot use that client with Hibernate Search of course.

I will probably send a PR indeed. Right now I am struggling to build upon HS functionality somehow but the best way would be to alter the internals. I however cannot upgrade to 5.8 or 5.7 right now, and Sanne Grinovero told me such things will not be backported.

Yoann RodièreMarch 17, 2017 at 8:35 AM

If you adapt dynamic sharding to map to ES shards instead of ES indexes, that will not fit my use case, because I still need to put some documents in ARCHIVE and others - in LIVE.

I see. Well then, I fear our biggest problem will be how to name this, given we already have a feature called "dynamic sharding" that doesn't really do "sharding" in Elasticsearch terms... We'll think about this.

I am implementing this ES integration in parallel with you, and it is not live yet. I am also searching for a way to extent Hibernate Search 5.6.1.Final to support _routing (even in a hardcoded way) but I cannot find an extension point for that.

Out of curiosity, even if it's not working yet, is this code available publicly?
By the way, if you want to work on it, feel free to send a PR adding this "_routing" support directly to our implementation (master branch, which is 5.8.0-SNAPSHOT). We'd be glad to merge it if it doesn't impair other features (though we may discuss the public-facing stuff), and it may be easier to implement since you'll be able to access internal engine code.

Out of Date

Details

Assignee

Reporter

Priority

Created March 8, 2017 at 2:17 PM
Updated February 8, 2021 at 9:58 AM
Resolved February 8, 2021 at 9:58 AM