Solution for fields derived from large blobs

Description

The content of some indexed fields may be derived from large blobs.

In such cases, it could be convenient, as well as provide the opportunity for optimizations, to have dedicated solutions, i.e. that works on an InputStream/URL/…

See also ,

Activity

Yoann RodièreMarch 6, 2024 at 4:08 PM

Relatedly, the processing of large blobs itself can take time, i.e. turning text into a vector through an AI model. We might want to approach this not as a field type, but as a “backend-level conversion” instead; critically, one that could be, depending on the use case:

  • batched to minimize overall latency – for remote data retrieval.

  • delayed until the last moment (e.g. some backend queue) to minimize memory usage – for large blob retrieval in general (local filesystem or remote URL, it doesn’t matter.

We somewhat discussed this here: https://hibernate.zulipchat.com/#narrow/stream/132092-hibernate-search-dev/topic/Batching.20value.20bridges . But I’m not sure this should be something we mix with bridges after all… A new backend-level component would probably make more sense and allow us to also address large blobs in general. The concept in itself is very similar to value bridges though; just more focused and with batching support.

I’m just thinking out loud, but we could imagine binders registering a “batch process” (name to be changed: extractor, loader, processor?) which has access to fields, bridges just passing values to that batch process, and backends executing those batch processes later and/or asynchronously to “amend” a document.

Details

Assignee

Reporter

Priority

Created September 25, 2023 at 12:19 PM
Updated March 6, 2024 at 4:10 PM