Hibernate-infinispan uses a very inefficient way to perform cache invalidation for bulk operations (JPA CriteriaUpdate/CriteriaDelete). Rather than broadcasting a clear to all nodes in the cluster, the cache is cleared entry by entry. As entity caches are invalidating caches by default, this requires a query to all remote nodes first to collect all keys. These keys are then bundled in a very large message and sent out to all nodes again. During this entire procedure, it seems the cache region is locked on all nodes, causing the entire cluster to stall (I presume this is needed to prevent inserts into the cache between the query and the invalidation phase).
We are seeing this behavior on WildFly 10.1.0, 11.0.0.CR1 and 11 master. The correspoding code in Hibernate is:
The current implementation makes it impossible to perform batch operations on large cache regions with tens of thousands of entries spanning multiple nodes without blocking the entire cluster for many seconds, even up to a minute. On some places we can change the code to update the entries one by one. However, in other places this will result in thousands of queries to the database in stead of 1, making it far from ideal.
It seems Infinispan lacks a cluster-wide clear command. Therefore, I'll be filing a bug report at Infinispan as well. Note that the documentation of Cache.entrySet contains the following sentence: "Use involving execution of this method on a production system is not recommended as they can be quite expensive operations".
WildFly 10.1.0, 11.0.0.CR1, 11.0.0 master
I'd say that it's a "won't fix" one. Switching configuration to non-transactional solves this (waiting for confirmation). There are few possible improvements on the logging side, but we'll handle that separately.
Actually BulkOperationCleanupAction calls removeAll which could be implemented in a more efficient way, using cache.clear().
We've been running a patched wildfly with the changes from https://github.com/rvansa/hibernate-orm/tree/HHH-12036 and non-transactional caches and I can confirm that this fixes our issues with cache invalidation. We haven't seen any timeouts since last thursday.
Thanks for the confirmation, I'll file appropriate PRs.
Applied PR upstream.