Align vector similarity naming between backends

Description

We have

all using different names; let’s review these and map them to some common values.

We might also want to add a table to the docs explaining what maps to what in different backends.

Activity

Show:

Yoann Rodière January 8, 2024 at 10:12 AM

hmm I think it may be a “no” on adding the missing similarity functions :

Looks like they didn’t close the issue, and Lucene added something to prevent the negative score that seems to be the main blocker for OpenSearch to support this, so… I’d say it’s a “maybe, one day”

soooo with that said, let’s keep the enum and just throw an exception with an OpenSearch distribution if something unsupported is used.

+1

Marko Bekhta January 8, 2024 at 9:27 AM

I’d personally prefer MAX_INNER_PRODUCT

yeah +1 on that

Maybe they intend to add them later?

hmm I think it may be a “no” on adding the missing similarity functions :

also just for reference, here’s where the spaces (“similarity functions”) are added to a Lucene lib in OpenSearch https://github.com/opensearch-project/k-NN/blob/271df52ea5d95d0f3b5f8e8b984878ba4b23b97b/src/main/java/org/opensearch/knn/index/util/Lucene.java#L45

soooo with that said, let’s keep the enum and just throw an exception with an OpenSearch distribution if something unsupported is used.

Yoann Rodière January 8, 2024 at 9:01 AM
Edited

The naming you’re suggesting looks good to me. I’d personally prefer MAX_INNER_PRODUCT over MAXIMUM_INNER_PRODUCT, but it’s not really important.

Alternatively… I was thinking about replacing that enum with a string and just pass it through to the backend, with OpenSearch/Elasticsearch it’ll go straight into mapping and we’ll let them validate the value, as for the Lucene backend, we can just do something like VectorSimilarityFunction.valueOf(..)

Seems like a good idea to make this future proof. You could have a VectorSimilarities class holding constants, similar to org.hibernate.search.engine.backend.types.IndexFieldTraits/ org.hibernate.search.backend.elasticsearch.types.ElasticsearchIndexFieldTraits, with backend-specific names in the backend-specific classes. DEFAULT would probably become an empty string, and wouldn’t have a dedicated constant anymore.

However, IMO the move to Strings would be more to anticipate future options that are made available in Elasticsearch only… I think it’s unlikely that a similarity function added to Lucene will stay unavailable in Elasticsearch for long. And I’d be tempted to keep DOT_PRODUCT/MAX_INNER_PRODUCT in the “commonly supported” options in the engine, since they are available in Lucene and Elasticsearch, so it looks more like OpenSearch is lagging behind to me. I guess the important thing to understand in order to take this decision is why these options are not available in OpenSearch (yet). Maybe they intend to add them later?

Marko Bekhta January 5, 2024 at 6:00 PM

Hibernate Search (suggested change)

Hibernate Search (current)

Lucene

Elasticsearch

OpenSearch

L2

L2

EUCLIDEAN

L2_NORM

l2

DOT_PRODUCT

INNER_PRODUCT

DOT_PRODUCT

DOT_PRODUCT

not supported

COSINE

COSINE

COSINE

COSINE: float&index version such that vectors are normalized (NORMALIZED_VECTOR_COSINE = def(8_500_005, Version.LUCENE_9_8_0)): DOT_PRODUCT, COSINE otherwise

cosinesimil

MAXIMUM_INNER_PRODUCT (add to the list)

  •  

MAXIMUM_INNER_PRODUCT

MAX_INNER_PRODUCT

not supported

Initially I wasn’t sure if MAX_INNER_PRODUCT was supported by Elasticsearch and OpenSearch, so that was why I didn’t add it, and I was other 3 would work fine, but as it turned out, OpenSearch does not support dot-product and max-inner-product for a Lucene engine. Hence since we are going to throw exceptions that a similarity function is not supported, I suppose we can add the max inner product to the list, and apply the name changes.

Alternatively… I was thinking about replacing that enum with a string and just pass it through to the backend, with OpenSearch/Elasticsearch it’ll go straight into mapping and we’ll let them validate the value, as for the Lucene backend, we can just do something like VectorSimilarityFunction.valueOf(..)

Fixed

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Created December 12, 2023 at 5:55 PM
Updated January 24, 2024 at 2:23 PM
Resolved January 23, 2024 at 12:22 PM