Querying elastic search triggers exception "Result window is too large"

Description

The following is the exact error:

To reproduce the problem, I have an index with 11,228 products and issued a query that returned 10,722 hits. The UI returns pages with a page size of 25 products, and so page 1 returned just fine with the first 25. By selecting the last page, which is page 429, I got this error.

Pagination parameters are provided to the FullTextQuery by specifying:

This results in 10,700 being the value for the first result.

Stack Trace:

If I eliminate the call to getResultSize() and execute the getResultList() first, I get the same error.

Environment

None

Activity

Show:
Guillaume Smet
April 29, 2016, 9:28 AM

This report is interesting as we can see that the problem is not only when we request a large number of results but also when there is a large number of results and we want to extract a small part of them starting after the 10000th result for instance.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html for reference about this limitation.

Chris Cranford
April 29, 2016, 2:43 PM

, right and I am not sure whether the scroll API would be the right solution here either. I honestly haven't dug into the ES integration that much beyond the user perspective. While the setting can be changed, when users are dealing with big data for whatever purpose this will be an issue at some point. Setting it to 15,000 / 20,000 / 100,000 only delays the inevitable.

Yoann Rodière
September 22, 2016, 2:33 PM

I just investigated the Scroll API to see if we could wire the ScrollableResults from a FullTextQuery (fullTextQuery.scroll()) to some object taking advantage of Elasticsearch's Scroll API. Well, we cannot, because ScrollableResults offers far more methods than what Elasticsearch provides. So using the Scroll API to implement ScrollableResults would mean throwing UnsupportedOperationException in most methods.
Implementing the basic queries (getResultList()/list(), with an offset and a maximum number of results) is not possible either with the Scroll API, or at least not in an efficient way: the scroll API does not allow using an offset (the from attribute is ignored), so we would have to scroll through every previous result each time a user uses an offset. For the same performance reasons, we cannot use the Scroll API as a fallback when Elasticsearch throws a "Result window is too large" error at us.

Also to be noted, increasing the value of index.max_result_window seems to be discouraged for performance reasons: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/breaking_21_search_changes.html#_from_size_limits

Here are the solutions:

  1. implementing the pre-existing Hibernate ORM scroll() method in such a way that it will work as usual with the Lucene backend, will also work (with unlimited scrolling) with the Elasticsearch backend, but that the scrollable results with the Elasticsearch backend will through UnsupportedOperationException in most methods (previous(), last(), setRowNumber(int), ...)

  2. implementing the pre-existing Hibernate ORM scroll() fully for both the Lucene and Elasticsearch backend, using horribly inefficient workarounds for methods not supported by the Elasticsearch Scroll API (previous(), last(), setRowNumber(int), ...).

Personally, I'd be in favor of solution 1. Offering inefficient methods in an API that's primarily aiming at processing large datasets efficiently seems a nonsense to me. But some implementors chose to do just that, like H2, so...

In any case, this will require non-trivial SPI additions (most notably a "scroll" method in org.hibernate.search.query.engine.spi.HSQuery). This might mean that the fix will only be merged in 6.0; I'll have to check with Sanne, I guess.

I'm starting the work on solution 1. Feel free to ping me if you disagree with the whole approach (better now than when I submit a PR )

Yoann Rodière
September 26, 2016, 1:24 PM

I ended up implementing solution 2, mainly because it was easier than filtering out the -engine/-orm tests that should not execute for the Elasticsearch integration.

Assignee

Yoann Rodière

Reporter

Chris Cranford

Labels

None

Suitable for new contributors

None

Feedback Requested

None

Components

Fix versions

Affects versions

Priority

Major
Configure