Timeouts start when the request is submitted to the client, leading to systematic timeout when client is under heavy load

Description

The problem is twofold:

  1. We have arbitrary timeouts when executing requests, and those timeouts start when we submit the request to the client rather than when the request starts executing. Before HSEARCH-2764, this timeout is set up by RestClient.performRequest (the synchronous form), after we set it up ourselves in DefaultElasticsearchClient.

  2. Retry timeouts (managed by RestClient) also start when we submit requests rather than when the request starts executing. See https://github.com/elastic/elasticsearch/issues/25951
    I added a test case for item 2 on my copy of the repo: https://github.com/yrodiere/hibernate-search/tree/HSEARCH-2836 . Note this will no longer be a problem when we upgrade to version 7 of the Elasticsearch client, since they removed this retry feature: https://github.com/elastic/elasticsearch/issues/25951#issuecomment-460929775

Activity

Show:

Sanne Grinovero August 3, 2017 at 10:13 AM

Thanks for the clarifications, understand it now.

Yoann Rodière July 31, 2017 at 5:09 PM

Sorry, the link to the Elasticsearch issue was wrong; the correct link is https://github.com/elastic/elasticsearch/issues/25951

> If we're talking about the timeout option which the end user can set in its own application, then I'm not sure I'd classify this as a bug?

In the "standard" case it's indeed what we want, but not when mass indexing.

When we're mass indexing, each work will take quite some time before it can even be sent to the Elasticsearch server because we pile up requests in a queue. This delay before we send the request is a delay we expect, a delay we can't estimate, and more importantly a delay that doesn't affect the total mass indexing execution time. So taking this delay into consideration in the timeout is not a good idea.

> Some systems set a timeout because they have to provide a reply within some t. After t they have to return with a failure and it's not really relevant what happened: the point is that it could not be performed in that time limit.

Well that's the point in general, but not always. Timeouts can be used for two reasons:

  • because after the timeout, it's simply too late, the client will abort anyway and executing the work won't achieve anything.

  • OR because after the timeout, we can reasonably assume that something went wrong and the work will never terminate.

When mass indexing, we won't abort just because there's a 10 second delay before a given work finishes, especially not if this delay was caused by the work waiting in a client-side queue.

Note that I'm not saying we should change the way timeouts work for every situation, just that they should be made more flexible.

> Often in such cases it's also useful to cancel any further pending work to save resources from being utilized for a task which is no longer being requested.

Yeah, I tried to implement it in HSEARCH-2764, but unfortunately the Elasticsearch Rest client doesn't provide APIs to cancel requests. I suppose I'll have to send them a PR someday.

Sanne Grinovero July 31, 2017 at 3:46 PM

If we're talking about the timeout option which the end user can set in its own application, then I'm not sure I'd classify this as a bug?

Some systems set a timeout because they have to provide a reply within some t. After t they have to return with a failure and it's not really relevant what happened: the point is that it could not be performed in that time limit.

Often in such cases it's also useful to cancel any further pending work to save resources from being utilized for a task which is no longer being requested.

Fixed

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Created July 28, 2017 at 3:25 PM
Updated November 3, 2020 at 10:19 AM
Resolved October 8, 2020 at 4:04 PM