Tenant IDs containing underscores are not supported with ES

Description

See DocumentIdHelper: if the tenant ID contains an underscore, it is likely that only the part before the underscore will be interpreted as the tenant ID, the remaining part being interpreted as the beginning of the entity ID.

We should escape underscore in the tenant ID, and take these escape into account when extracting back the entity ID.
Also, we should check whether there are other places where we should take this escaping into account (most notably queries).

Environment

None

Activity

Show:
Yoann Rodière
October 27, 2016, 7:57 AM

Some thoughts:

There are multiple strategies for serializing/deserializing the tenant ID / entity ID pair. We could:

  • use an exotic string as separator such as "__HSearch_tenant_id:" (and hope for the best)

  • roll out our own escape/unescape mechanism (`org.apache.commons.lang3.text.translate.CharSequenceTranslator` could help, but this would sill be tricky)

  • serialize the tenant ID / entity ID pair as a JSON array (this would not be

In any case, those strategies should only be used if multi-tenancy is enabled.

Using our own escape/unescape mechanism may result in more readable IDs (if we choose sufficiently exotic characters), but would also be more prone to bugs. We could use off-the-shelf algorithms, though, such as `org.apache.commons.lang3.text.translate.CharSequenceTranslator` (see examples like `org.apache.commons.lang3.StringEscapeUtils.ESCAPE_JAVA` and `org.apache.commons.lang3.StringEscapeUtils.UNESCAPE_JAVA`; it would probably be much simpler in our case), but still, transforming the concatenated string back to a pair of escaped strings won't be easy.

In the end, we may be better served by never trying to interpret the Elasticsearch document ID, and simply using the ID field that was added in a recent PR: https://github.com/hibernate/hibernate-search/pull/1192. That PR has not been merged yet, though.

Guillaume Smet
October 27, 2016, 10:31 AM

note that we don't really need to extract the tenant id as we know its value (the idea of tenant is that we should never access entities from a different tenant). We could simply fix it by removing the tenantId + "_" token from the string instead of trying to parse it.

But in the end, I think my vote goes to considering the _id field as strictly internal to Elasticsearch and never use it.

Yoann Rodière
October 27, 2016, 11:42 AM

You're right, I was making it more complicated than it needs. Glad it doesn't change the solution: the PR is coming.

Assignee

Yoann Rodière

Reporter

Yoann Rodière

Labels

None

Suitable for new contributors

None

Feedback Requested

None

Components

Fix versions

Affects versions

Priority

Major
Configure