Uploaded image for project: 'Hibernate Search'
  1. HSEARCH-2674

Move the "active shards" state out of ShardIdentifierProvider

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.x-backlog
    • Component/s: engine
    • Labels:
      None

      Description

      Currently, when we have to query a given entity with dynamic sharding enabled, we determine the index managers to query by a simple request to the DynamicIndexShardingStrategy, which basically delegates to a ShardIdentifierProvider. This class internally holds a list of all the shards that are relevant for a given entity.

      This is nice because it allows different entities to share the same "logical" index, but have a different ShardIdentifierProvider instance, meaning you can have some shards that are exclusive to some entities and won't even be queried for other entities.

      The problem is, this prevents Hibernate Search to ever add a shard independently from the ShardIdentifierProvider. This is necessary for instance with JMS/JGroups, where we have multiple instances of a ShardIdentifierProvider over multiple servers, each with its own state. We would need to replicate this state, creating new shards on the master when we detect they are needed on the slave. See HSEARCH-1886 Closed an example of issue that arise if we don't.

      To be more concrete, the slave calls org.hibernate.search.store.IndexShardingStrategy.getIndexManagerForAddition(Class<?>, Serializable, String, Document) on its own sharding strategy, which adds a new shard to the list of available shards, and returns the index manager. Then we know which index manager to target, and we send its name along with a list of works to the master. On the master, ideally we'd want to retrieve the index manager, and apply the works directly. But we have to call getIndexManagerForAddition(Class<?>, Serializable, String, Document) for each and every work, because it is the only way to update the internal state of the DynamicIndexShardingStrategy/ShardIdentifierProvider.

      A perhaps better solution would be to hold the state of the ShardIdentifierProvider (a set of strings representing shard identifiers) out of the ShardIdentifierProvider. We would have to change its methods, though:

      • void initialize(Properties properties, BuildContext buildContext); would now return a Set<String> (the initial state)
      • getShardIdentifier, getShardIdentifierForDeletion and getShardIdentifierForQuery would all take a Set<String> as a parameter (the current state, unmodifiable)
        => Actually, it would be even better if if we could make those methods state-independent. getShardIdentifier probable doesn't need this state, and we could change getShardIdentifierForDeletion as well as getShardIdentifierForQuery, making them return something like an Optional<Set<String>>, absent meaning "no explicit filter, just target all shards" and present meaning "explicit shard filter, only target those shards". Or something more elegant, but which would make the difference between "I don't know, just target everything" and "I know, only target a subset".
        The reason is, for some indexes (e.g. Elasticsearch) we can query all the indexes without worrying about which index exist (by targeting types instead), allowing to get rid of shard state completely if we can toake the difference between "I don't know, just target everything" and "I know, only target a subset": targeting all the shards doesn't require to know which shards exist, and targeting a specific set of shards doesn't either. See HSEARCH-2725 Open for more information.
      • getAllShardIdentifiers would not be needed anymore

      When state is actually needed, it could be stored in DynamicShardingStrategy for example (which is an implementation class). Then we could add an getOrCreateIndexManager(String shardIdentifier) method to this class, which would update the state independently from the ShardIdentifierProvider, allowing for external synchronization in the JGroups/JMS case.

      Note that for this change to be useful for JMS/JGroups, we would also need to change the way we group tasks for these: we would need to group the works by entityType + shardIdentifier, instead of indexNameBase + shardIdentifier currently. That's because the DynamicShardingStrategies are defined per entity, not per index.

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yrodiere Yoann Rodière
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: