Solution for fields derived from large blobs

Description

The content of some indexed fields may be derived from large blobs.

In such cases, it could be convenient, as well as provide the opportunity for optimizations, to have dedicated solutions, i.e. that works on an InputStream/URL/…

Linked issues

relates to

HSEARCH-3465

Multi-fields (backend-defined copies of fields using a slightly different type)

required for

HSEARCH-3350

Restore support for Tika bridges

Activity

Yoann RodièreMarch 6, 2024 at 4:08 PM

Relatedly, the processing of large blobs itself can take time, i.e. turning text into a vector through an AI model. We might want to approach this not as a field type, but as a “backend-level conversion” instead; critically, one that could be, depending on the use case:

batched to minimize overall latency – for remote data retrieval.
delayed until the last moment (e.g. some backend queue) to minimize memory usage – for large blob retrieval in general (local filesystem or remote URL, it doesn’t matter.

We somewhat discussed this here: https://hibernate.zulipchat.com/#narrow/stream/132092-hibernate-search-dev/topic/Batching.20value.20bridges . But I’m not sure this should be something we mix with bridges after all… A new backend-level component would probably make more sense and allow us to also address large blobs in general. The concept in itself is very similar to value bridges though; just more focused and with batching support.

I’m just thinking out loud, but we could imagine binders registering a “batch process” (name to be changed: extractor, loader, processor?) which has access to fields, bridges just passing values to that batch process, and backends executing those batch processes later and/or asynchronously to “amend” a document.

Details
Assignee
Unassigned
Reporter
Yoann Rodière
Priority
Major
Parent
HSEARCH-4974 Advanced/specialized index field types

Created September 25, 2023 at 12:19 PM

Updated March 6, 2024 at 4:10 PM

Solution for fields derived from large blobs

Description

Linked issues

relates to

required for

Activity

Yoann RodièreMarch 6, 2024 at 4:08 PM

DetailsAssigneeUnassignedUnassignedReporterYoann RodièreYoann RodièrePriorityMajorParentHSEARCH-4974 Advanced/specialized index field types

Details

Assignee

Reporter

Priority

Parent

Details
Assignee
Unassigned
Reporter
Yoann Rodière
Priority
Major
Parent
HSEARCH-4974 Advanced/specialized index field types