Improve massindexer with Elasticsearch by disabling some refresh and replication

Description

Consider in MassIndexer to set refresh period to -1 (i.e. infinite) while doing mass indexer and then set back to the default value when it's done.
Also consider changing the replicat # to 0 while indexing and back to its value when done. That's a bit dangerous so only do on initial indexing (fresh index without usage?).

Activity

Show:

Yoann RodièreJuly 31, 2017 at 9:38 AM
Edited

Note that I addressed part of the issue in my PR for HSEARCH-2764, by disabling any kind of explicit refresh in stream works (except before a DeleteByQuery work) and by making stream work execution parallel.
Maybe it will be enough to close this ticket.

Yoann RodièreJune 12, 2017 at 4:10 PM
Edited

This page documents some interesting options for wait: a ?refresh=wait_for and ?refresh=false options to wait for a specific document operation to be applied, or explicitly skip waiting for one operation.

Sure, but we already alter the "refresh" parameter based on our configuration and make sure we only execute explicit refreshes when really necessary. We don't use "wait_for", though. Not sure why we would... ?

One thing we could do is skipping refresh altogether, independently of the "refresh_after_write" parameter, when doing mass indexing? Not sure it would change much (because we already bulk operations), but it's worth a try.

While here it mentions "To alter this behavior per operation, the wait_for_active_shards request parameter can be used." -> looks like some tuning can be done on a per-setting base, maybe more?

The default also seems to be the most efficient (only wait for the primary shard). But maybe we could add request parameters when mass indexing just in case the user increased the value of the index setting... ?

In the case of the translog settings, maybe we don't even need this: It's reasonable to expect to log at least an error message when an operation failed; we need to take advantage of the async design of the new Elasticsearch client to keep pushing more indexing operations while performing batch indexing, but still have an error-handler context attached to the older operations which have been sent but not ACKed yet.

Sure: https://hibernate.atlassian.net/browse/HSEARCH-2764#icft=HSEARCH-2764

Sanne GrinoveroJune 12, 2017 at 3:39 PM

This page documents some interesting options for wait: a ?refresh=wait_for and ?refresh=false options to wait for a specific document operation to be applied, or explicitly skip waiting for one operation.

While here it mentions "To alter this behavior per operation, the wait_for_active_shards request parameter can be used." -> looks like some tuning can be done on a per-setting base, maybe more?

In the case of the translog settings, maybe we don't even need this: It's reasonable to expect to log at least an error message when an operation failed; we need to take advantage of the async design of the new Elasticsearch client to keep pushing more indexing operations while performing batch indexing, but still have an error-handler context attached to the older operations which have been sent but not ACKed yet.

Sanne GrinoveroApril 3, 2017 at 3:49 PM

We won't change the replica settings, but we are exploring options to make the Refresh operations lighter.

Sanne GrinoveroNovember 4, 2016 at 2:06 PM

Our current MassIndexer implementations don't really allow an "index upgrade" but always expect the index to be empty, so I'd always apply this flag.

Details

Assignee

Reporter

Priority

Created November 4, 2016 at 1:34 PM
Updated September 25, 2023 at 2:48 PM

Flag notifications