Implement ccr file restore #37130

Tim-Brooks · 2019-01-04T00:44:21Z

This is related to #35975. It implements a file based restore in the
CcrRepository. The restore transfers files from the leader cluster
to the follower cluster. It does not implement any advanced resiliency
features at the moment. Any request failure will end the restore.

elasticmachine · 2019-01-04T00:44:22Z

Pinging @elastic/es-distributed

ywelsch

Thanks @tbrooks8. My main concern is with the large amount of code copied from BlobStoreRepository, which will be difficult to maintain. Can you investigate ways of reusing code?

.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java

...main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkRequest.java

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

ywelsch · 2019-01-04T11:22:38Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

@@ -133,6 +133,17 @@ public synchronized void closeSession(String sessionUUID) {
        IOUtils.closeWhileHandlingException(restore);
    }

+    // TODO: The Engine.IndexCommitRef might be closed by a different thread while it is in use. We need to
+    //  look into the implications of this.


Similar to the RecoveryTarget object, we could use some ref-counting here (see AbstractRefCounted)

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

s1monw

left some generic comments

s1monw · 2019-01-06T10:19:08Z

.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java

+        private final ThreadPool threadPool;
+
+        @Inject
+        public TransportGetCcrRestoreFileChunkAction(ThreadPool threadPool, TransportService transportService, ActionFilters actionFilters,


can you add a ctor to HandledTransportAction that allows you to specify the threadpool instead of forking internally?

I would like to do this in a follow up? There are some tricky components to this for a generic solution (such as stashing headers to the new thread, etc). And questions about whether we want to do the action filter on a new thread.

not sure I follow. Why is this action different from any other?

.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

s1monw

left some more comments. We are getting there! thanks for the iterations

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

s1monw · 2019-01-11T08:20:56Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

+            // Should not access this method while holding global lock as that might block the cluster state
+            // update thread on IO if it calls afterIndexShardClosed
+            assert Thread.holdsLock(CcrRestoreSourceService.this) == false : "Should not hold CcrRestoreSourceService lock";
+            if (cachedInput != null) {


I don't like the cachedInput I think what we should do here is to have a Map<String,IndexInput> and close it after a read if (in.getFilePointer() == in.length()) Also I wonder if the sync on this method is necessary or if we rather should not synchronized but instead have a KeyedLock<String>. This way we can implement this:

try (Releasable lock = keyedLock.tryAcquire(fileName)) { if (lock == null) { throw new IllegalStateException("can't read from the same file on the same session concurrently"); } // do the read }

this would give us the right guarantees rather than hiding bugs.

s1monw · 2019-01-11T08:23:46Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

+
+            int bytesRequested = (int) Math.min(remainingBytes, len);
+            String fileName = fileToRecover.name();
+            GetCcrRestoreFileChunkRequest request = new GetCcrRestoreFileChunkRequest(node, sessionUUID, fileName, bytesRequested);


I think we should get back the offset from the remote call and then assert that is matches with the offset we have stored in pos This would add another level of safety.

Tim-Brooks · 2019-01-12T02:08:01Z

I implemented @s1monw changes.

s1monw

LGTM left 1 comment

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

ywelsch

Looks very good. I've left some nits and one minor comment about the offset field.

ywelsch · 2019-01-14T09:41:58Z

...main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkRequest.java

+
+    @Override
+    public DiscoveryNode getPreferredTargetNode() {
+        return node;


maybe assert here that node is not-null?

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

ywelsch · 2019-01-14T10:04:43Z

.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java

+
+    public static class GetCcrRestoreFileChunkResponse extends ActionResponse {
+
+        private final long offset;


offset is typically the beginning of the range that is sent. Having it different here is a potential source for confusion in the future. I see two options: 1) Name it differently or 2) make it the offset that represents the beginning of the range that is sent. I'm strongly favoring 2), which is a small change to make, and provides the same validation on the receiver.

Went with number 2.

ywelsch

LGTM. You'll need to merge latest master here to get CI passing.

Tim-Brooks · 2019-01-14T17:44:25Z

run gradle build tests 1

* master: (28 commits) Introduce retention lease serialization (elastic#37447) Update Delete Watch to allow unknown fields (elastic#37435) Make finalize step of recovery source non-blocking (elastic#37388) Update the default for include_type_name to false. (elastic#37285) Security: remove SSL settings fallback (elastic#36846) Adding mapping for hostname field (elastic#37288) Relax assertSameDocIdsOnShards assertion Reduce recovery time with compress or secure transport (elastic#36981) Implement ccr file restore (elastic#37130) Fix Eclipse specific compilation issue (elastic#37419) Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415) [ML] Use String rep of Version in map for serialisation (elastic#37416) Cleanup Deadcode in Rest Tests (elastic#37418) Mute IndexShardRetentionLeaseTests.testCommit elastic#37420 unmuted test Remove unused index store in directory service Improve CloseWhileRelocatingShardsIT (elastic#37348) Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360) Update the scroll example in the docs (elastic#37394) Update analysis.asciidoc (elastic#37404) ...

This is related to elastic#35975. It implements a file based restore in the CcrRepository. The restore transfers files from the leader cluster to the follower cluster. It does not implement any advanced resiliency features at the moment. Any request failure will end the restore.

This is related to #35975. It implements a file based restore in the CcrRepository. The restore transfers files from the leader cluster to the follower cluster. It does not implement any advanced resiliency features at the moment. Any request failure will end the restore.

A recent refactoring (#37130) where imports got mixed up (changing Lucene's IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being logged in case of restoring a fresh snapshot.

Tim-Brooks added 7 commits January 2, 2019 13:19

WIP

d996d4e

WIP

8471837

Work on restore process

ee37b25

WIP

cce5c99

WIP

11ce19c

Fix checkstyle

6af7ec0

Change

f774c28

Tim-Brooks added >non-issue v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.7.0 labels Jan 4, 2019

Tim-Brooks requested a review from bleskes January 4, 2019 00:44

Tim-Brooks requested a review from ywelsch January 4, 2019 00:44

Tim-Brooks added 2 commits January 3, 2019 18:07

Fix logger check

359ad5b

Merge remote-tracking branch 'upstream/master' into ccr_file_recovery

2031fd2

ywelsch suggested changes Jan 4, 2019

View reviewed changes

Tim-Brooks added 5 commits January 4, 2019 10:04

Changes

e18df0a

Add basic ref counting

da5b2e7

Extract context

2209eac

Use in ccr

79a53e0

Add file reader

142800d

s1monw suggested changes Jan 6, 2019

View reviewed changes

Tim-Brooks mentioned this pull request Jan 7, 2019

Implement CCR bootstrap from remote #35975

Closed

32 tasks

Tim-Brooks added 6 commits January 7, 2019 11:16

Merge remote-tracking branch 'upstream/master' into ccr_file_recovery

6ddaa00

WIP

478e25c

Cache file

0eba9d7

Work in progress

0e56995

Checkstyle

8083076

Fix blob store repository

6c011bc

Tim-Brooks requested review from ywelsch and s1monw January 11, 2019 02:04

s1monw suggested changes Jan 11, 2019

View reviewed changes

ywelsch and others added 2 commits January 11, 2019 19:07

no recursive dependencies

fa59f49

Changes

d872334

Tim-Brooks requested a review from s1monw January 12, 2019 02:08

s1monw approved these changes Jan 12, 2019

View reviewed changes

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java Outdated Show resolved Hide resolved

Don't swallow

92435b0

ywelsch suggested changes Jan 14, 2019

View reviewed changes

Changes

2bfa3e1

ywelsch approved these changes Jan 14, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into ccr_file_recovery

9005245

Merge remote-tracking branch 'upstream/master' into ccr_file_recovery

cb1d455

Tim-Brooks merged commit 5c68338 into elastic:master Jan 14, 2019

Tim-Brooks added the backport pending label Jan 14, 2019

Tim-Brooks removed the backport pending label Jan 21, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

ywelsch mentioned this pull request Mar 20, 2019

Fix snapshot restore logging on fresh restore #40252

Merged

Tim-Brooks deleted the ccr_file_recovery branch December 18, 2019 14:47


		public static class GetCcrRestoreFileChunkResponse extends ActionResponse {

		private final long offset;

Implement ccr file restore #37130

Implement ccr file restore #37130

Uh oh!

Conversation

Tim-Brooks commented Jan 4, 2019

Uh oh!

elasticmachine commented Jan 4, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks commented Jan 12, 2019

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks commented Jan 14, 2019

Uh oh!

Uh oh!