Improve massindexer with Elasticsearch by disabling some refresh and replication

Description

Consider in MassIndexer to set refresh period to -1 (i.e. infinite) while doing mass indexer and then set back to the default value when it's done.
Also consider changing the replicat # to 0 while indexing and back to its value when done. That's a bit dangerous so only do on initial indexing (fresh index without usage?).

Environment

None

Activity

Show:
Sanne Grinovero
November 4, 2016, 2:06 PM

Our current MassIndexer implementations don't really allow an "index upgrade" but always expect the index to be empty, so I'd always apply this flag.

Sanne Grinovero
April 3, 2017, 3:49 PM

We won't change the replica settings, but we are exploring options to make the Refresh operations lighter.

Sanne Grinovero
June 12, 2017, 3:39 PM

This page documents some interesting options for wait: a ?refresh=wait_for and ?refresh=false options to wait for a specific document operation to be applied, or explicitly skip waiting for one operation.

While here it mentions "To alter this behavior per operation, the wait_for_active_shards request parameter can be used." -> looks like some tuning can be done on a per-setting base, maybe more?

In the case of the translog settings, maybe we don't even need this: It's reasonable to expect to log at least an error message when an operation failed; we need to take advantage of the async design of the new Elasticsearch client to keep pushing more indexing operations while performing batch indexing, but still have an error-handler context attached to the older operations which have been sent but not ACKed yet.

Yoann Rodière
June 12, 2017, 4:10 PM
Edited

This page documents some interesting options for wait: a ?refresh=wait_for and ?refresh=false options to wait for a specific document operation to be applied, or explicitly skip waiting for one operation.

Sure, but we already alter the "refresh" parameter based on our configuration and make sure we only execute explicit refreshes when really necessary. We don't use "wait_for", though. Not sure why we would... ?

One thing we could do is skipping refresh altogether, independently of the "refresh_after_write" parameter, when doing mass indexing? Not sure it would change much (because we already bulk operations), but it's worth a try.

While here it mentions "To alter this behavior per operation, the wait_for_active_shards request parameter can be used." -> looks like some tuning can be done on a per-setting base, maybe more?

The default also seems to be the most efficient (only wait for the primary shard). But maybe we could add request parameters when mass indexing just in case the user increased the value of the index setting... ?

In the case of the translog settings, maybe we don't even need this: It's reasonable to expect to log at least an error message when an operation failed; we need to take advantage of the async design of the new Elasticsearch client to keep pushing more indexing operations while performing batch indexing, but still have an error-handler context attached to the older operations which have been sent but not ACKed yet.

Sure:

Yoann Rodière
July 31, 2017, 9:38 AM
Edited

Note that I addressed part of the issue in my PR for HSEARCH-2764, by disabling any kind of explicit refresh in stream works (except before a DeleteByQuery work) and by making stream work execution parallel.
Maybe it will be enough to close this ticket.

Assignee

Unassigned

Reporter

Emmanuel Bernard

Labels

None

Suitable for new contributors

None

Pull Request

None

Feedback Requested

None

Components

Fix versions

Priority

Major
Configure