We're updating the issue view to help you get more done. 

Move the "active shards" state out of ShardIdentifierProvider

Description

Currently, when we have to query a given entity with dynamic sharding enabled, we determine the index managers to query by a simple request to the DynamicIndexShardingStrategy, which basically delegates to a ShardIdentifierProvider. This class internally holds a list of all the shards that are relevant for a given entity.

This is nice because it allows different entities to share the same "logical" index, but have a different ShardIdentifierProvider instance, meaning you can have some shards that are exclusive to some entities and won't even be queried for other entities.

The problem is, this prevents Hibernate Search to ever add a shard independently from the ShardIdentifierProvider. This is necessary for instance with JMS/JGroups, where we have multiple instances of a ShardIdentifierProvider over multiple servers, each with its own state. We would need to replicate this state, creating new shards on the master when we detect they are needed on the slave. See an example of issue that arise if we don't.

To be more concrete, the slave calls org.hibernate.search.store.IndexShardingStrategy.getIndexManagerForAddition(Class<?>, Serializable, String, Document) on its own sharding strategy, which adds a new shard to the list of available shards, and returns the index manager. Then we know which index manager to target, and we send its name along with a list of works to the master. On the master, ideally we'd want to retrieve the index manager, and apply the works directly. But we have to call getIndexManagerForAddition(Class<?>, Serializable, String, Document) for each and every work, because it is the only way to update the internal state of the DynamicIndexShardingStrategy/ShardIdentifierProvider.

A perhaps better solution would be to hold the state of the ShardIdentifierProvider (a set of strings representing shard identifiers) out of the ShardIdentifierProvider. We would have to change its methods, though:

  • void initialize(Properties properties, BuildContext buildContext); would now return a Set<String> (the initial state)

  • getShardIdentifier, getShardIdentifierForDeletion and getShardIdentifierForQuery would all take a Set<String> as a parameter (the current state, unmodifiable)
    => Actually, it would be even better if if we could make those methods state-independent. getShardIdentifier probable doesn't need this state, and we could change getShardIdentifierForDeletion as well as getShardIdentifierForQuery, making them return something like an Optional<Set<String>>, absent meaning "no explicit filter, just target all shards" and present meaning "explicit shard filter, only target those shards". Or something more elegant, but which would make the difference between "I don't know, just target everything" and "I know, only target a subset".
    The reason is, for some indexes (e.g. Elasticsearch) we can query all the indexes without worrying about which index exist (by targeting types instead), allowing to get rid of shard state completely if we can toake the difference between "I don't know, just target everything" and "I know, only target a subset": targeting all the shards doesn't require to know which shards exist, and targeting a specific set of shards doesn't either. See for more information.

  • getAllShardIdentifiers would not be needed anymore

When state is actually needed, it could be stored in DynamicShardingStrategy for example (which is an implementation class). Then we could add an getOrCreateIndexManager(String shardIdentifier) method to this class, which would update the state independently from the ShardIdentifierProvider, allowing for external synchronization in the JGroups/JMS case.

Note that for this change to be useful for JMS/JGroups, we would also need to change the way we group tasks for these: we would need to group the works by entityType + shardIdentifier, instead of indexNameBase + shardIdentifier currently. That's because the DynamicShardingStrategies are defined per entity, not per index.

Environment

None

Status

Assignee

Unassigned

Reporter

Yoann Rodière

Labels

None

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Priority

Major