Configurable QueryPlanCache reference counts (Manuel Dominguez Sarmiento)

Description

Some of our production servers (Hibernate-based apps) have been hanging on full GC sporadically, most of the time after running normally for a few days, sometimes even over a week.

We suspected a memory leak. We used the Eclipse MAT tool to analyze a live heap dump, and found that most of the heap was being used by QueryPlanCache, most specifically, by the soft references held by SoftLimitMRUCache.

We use very large heaps (up to 30 GB in some cases). Since memory is plentiful and the SoftLimitMRUCache is unbounded, the heap eventually fills up until a major stop-the-world GC is necessary to cleanse the SoftLimitMRUCache soft references. We performed several live tests configuring the Concurrent-Mark-Sweep (CMS) GC collector in order to avoid the full GC caused by concurrent mode failures. We played around with the following settings available in the Sun JVM:

-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:CMSIncrementalDutyCycle=<PCT>
-XX:CMSIncrementalDutyCycleMin=<PCT>
-XX:CMSInitiatingOccupancyFraction=<PCT>
-XX:CMSMarkStackSize=<SIZE>
-XX:CMSMarkStackSizeMax=<SIZE>
-XX:SoftRefLRUPolicyMSPerMB=<MSECS>
-XX:+ParallelRefProcEnabled

Most of these options helped somewhat by allowing soft-reference GC to be performed in parallel, ahead of time before tipping the scales and requiring full GC. However, this did not avoid all problems, and the servers still periodically hang upon concurrent mode failures. These are high-load web servers which process hundreds of hits per second, so full GC is disastrous as garbage cannot be collected fast enough. Full GC would sometimes take over 15 minutes, sometimes it would not even finish doing its job requiring a manual app restart.

Before anyone cries out "well, it's probably the application's fault, why do you have so many different queries? Aren't you using parameterized queries / PreparedStatements?" - the application does in fact produce many, many different queries, but most of them are not reused. Even if the QueryPlanCache is highly effective, most of the absolute number of queries are issued only once. Some use cases are the following:

  • Our system allows ad-hoc reporting and searching capabilities. Each query is typically issued once and never reused.

  • Many of our parameterized queries use IN clauses with variable-length collection/array parameters. I'm unsure whether this affects the cache hit ratio for HQL query plans, but it sure affects native SQL queries, since multiple collection lengths means multiple "?" in the query string, and thus, multiple similar, if slightly different queries polluting the corresponding plan cache.

I'm sure there are other legitimate cases in which an unbounded query plan cache is also a problem, I'm just enumerating the ones we've run into.

This issue is not new. The same problem is described, in one way or another, by HHH-2431, HHH-3191, and HHH-4627. I created a new issue since we've produced a working patch and it would be made more visible, rather than as comments in the previous ones.

The solution involves giving up the unbounded soft-reference-based cache. We introduced two new configuration options:

  • hibernate.query.plan_cache_max_strong_references -> defaults to 128

  • hibernate.query.plan_cache_max_soft_references -> defaults to 2048

Entries are evicted using an LRU policy, or by memory pressure from the GC in the case of soft references. We used 2048 as a default for the soft size since it seems pretty reasonable, but of course it can be tuned to suit the user's needs. Users looking to emulate previous behaviour (we don't see the point, but who knows ...) can set this option to Integer.MAX_VALUE

The patch introduces no new dependencies. It uses Apache Commons Collections' LRUMap (just as the released version does), and does away with the ReferenceMap (which does not support LRU eviction) in order to manage soft references manually on top of an LRUMap.

We see this issue as a top priority and should be applied to the trunk ASAP. SoftLimitMRUCache has seen the most "radical" changes. SimpleMRUCache, QueryPlanCache and Environment only contain minor changes.

Environment

N/A

Activity

Show:
Nick Hristov
January 22, 2011, 1:59 AM

We also have this problem. In our case we have a huge portion of the server memory taken up by this cache ~ 77%, yet most of our queries are parameterized. With a very large heap, servers go down for a while in order to clean up all soft references.

How does this happen? Well, several issues are present:

  • first, queries which contain parameter as a collection get expanded and that resulting expanded query is the key. For example, if we have the following query:

SELECT {u.*} FROM users u WHERE userid IN (:userIds)

it will get expanded by hibernate into (if we give two elements in the set to):

SELECT {u.*} FROM users u WHERE userid IN (:userId0_, :userId1_)

Caching happens against the second query.

  • the second, and more severe problem in my opinion is that NativeSQLQuerySpecification uses the queryReturns as a part of the hashCode() and equals() method... now whether that should be so, that something that you need to decide. But the underlying array objects are NativeSQLQueryRootReturn which does not implement equals() and hashCode()... which means that every single time we get a cache miss.

Nick Hristov
January 26, 2011, 1:07 AM

Looks like the last issue that I was talking about is fixed for hibernate 3.6. Hibernate 3.3 suffers from this problem.

Lari Hotari
October 3, 2011, 6:05 PM

There is a concurrency problem with SoftLimitMRUCache . Please fix issue HHH-5927 .

ArtemA
September 10, 2013, 1:49 PM

i'm experiencing the described behaviour with 4.1.5.SP1. was this issue fixed for that branch or i have to patch it by myself?

PaulF
June 13, 2015, 8:09 AM

Faced this problem too. Hibernate 4.3.5. We face this even harder because we use oracle and so have to use ((a IN (:a1)) OR (a IN (:a2))) where max size of each collection in [a1, a2 ... an] is 1000. And we have to generate these queries dynamically.

Assignee

Steve Ebersole

Reporter

Manuel Dominguez Sarmiento

Fix versions

Labels

None

backPortable

None

Suitable for new contributors

None

Requires Release Note

Affirmative

Pull Request

None

backportDecision

None

Components

Affects versions

Priority

Major
Configure