Query time join

Description

Lucene 3.6 introduces the notion of "Query Time Join": a way to relate Documents from different indexes and filter content and retrieve fields. This approach comes at a runtime costs as an extra pass is involved in processing the query.

The idea is basically that if you search on e.g. Post instances and you need the photo of the User that is part of the Post, you can keep this information separate and retrieve the User on the fly. This way you can ensure that fields that change in the User don't require a re-indexing of all the related Comments - http://www.searchworkings.org/blog/-/blogs/412000

Query time joining in Lucene is pretty straight forward, and entirely encapsulated in JoinUtil.createJoinQuery. It requires the following arguments:

fromField - The entity field to join in the entity being queried: e.g. user.id
toField - The entity field in the related index to join on: e.g. id.
fromQuery - The query executed to collect the from terms.
fromSearcher - The search on where the fromQuery is executed.
multipleValuesPerDocument - Whether the fromField contains more than one value per document (multi-valued field). If this option is set to true the from terms can be collected in a more efficient manner.

Since this doesn't require indexing changes and just affects what is returned, it can simply be implemented as an extension to the QueryBuilder.

I'm not sure at this point but I believe that query joining doesn't actually retrieve the related document. Which would be a nice feature also.

Linked issues

is duplicated by

HSEARCH-1631

Implement QueryTime Join into the DSL

relates to

HSEARCH-2263

Use nested objects mapping and parent-child relationship mapping

HSEARCH-2498

Use a generic representation of queries in the DSL

Activity

Show:

Yoann Rodière June 19, 2017 at 10:45 AM

Resurecting this... There's a great thing with joins: you can search for elements matching multiple conditions in collections properties. For example search for all groups that have a post with a title containing "lucene" and a body containing "solr". Right now with @IndexedEmbedded, it's not possible (see this question on stackoverflow for instance).

So I think we definitely need something in the DSL. Also, having a dedicated feature in the DSL would allow for arbitrary joins, which can be useful from time to time.

We could also add a way to do simpler join queries with indexing metadata, but I think it's a separate subject. It may be addressed as part of for instance, since this seems very close to Elasticsearch's `nested` datatype (though not exactly the same): https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.htm

I'm moving this to 6 because I think we definitely need to at least check that it will be doable in 6 (especially with respect to HSEARCH-2498). I would even be tempted to do a 5.9 just for this feature, but we can't keep postponing 6 forever...

Marc Schipperheyn February 11, 2015 at 2:11 AM

So, QueryTimeJoin basically allows you to filter a resultset based on a selection from a specific index that may be different than the one you're querying.

The way it works is that you basically select a single field based on a query and use that to filter on a field in the query you are executing. So in SQL terms, it can be seen as a WHERE myId IN (select myId from ) type query

One thing to realize is that due to current limitations in this Lucene module, the fields that are used to execute the filter have to be text fields.

In terms of API, perhaps this could be defined as such

Marc Schipperheyn April 5, 2014 at 1:25 PM

An interesting article published about the subject: http://blog.seecr.nl/2014/02/24/a-faster-join-for-solrlucene/

Sanne Grinovero February 21, 2014 at 9:46 PM
Edited

Hi Marc, agreed this looks like awesome to have.

Having it into the DSL is for sure a way, but I'm wondering if it could be defined on the indexing metadata? we could produce the join query transparently based on the field names.

I'll flag it as 5.1: we have many things on the roadmap already, and I don't think we'll be able to make it earlier. I'd rather have a quick 5.0 than release in ages, but we can start thinking about this in the scope of the internal refactorings.

Marc Schipperheyn February 19, 2014 at 11:40 AM

In Lucene 4.x this has now become standard and performant. I would recommend adding this functionality through the DSL and adding it to the 5.0 roadmap.

Details

Assignee

Unassigned

Reporter

Marc Schipperheyn

Components

Priority

Major

Created November 24, 2012 at 11:36 PM

Updated September 25, 2023 at 2:48 PM

Configure