-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Implement ccr file restore #37130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ccr file restore #37130
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tbrooks8. My main concern is with the large amount of code copied from BlobStoreRepository, which will be difficult to maintain. Can you investigate ways of reusing code?
.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java
Outdated
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkRequest.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java
Outdated
Show resolved
Hide resolved
@@ -133,6 +133,17 @@ public synchronized void closeSession(String sessionUUID) { | |||
IOUtils.closeWhileHandlingException(restore); | |||
} | |||
|
|||
// TODO: The Engine.IndexCommitRef might be closed by a different thread while it is in use. We need to | |||
// look into the implications of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the RecoveryTarget
object, we could use some ref-counting here (see AbstractRefCounted
)
x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some generic comments
private final ThreadPool threadPool; | ||
|
||
@Inject | ||
public TransportGetCcrRestoreFileChunkAction(ThreadPool threadPool, TransportService transportService, ActionFilters actionFilters, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a ctor to HandledTransportAction that allows you to specify the threadpool instead of forking internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to do this in a follow up? There are some tricky components to this for a generic solution (such as stashing headers to the new thread, etc). And questions about whether we want to do the action filter on a new thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow. Why is this action different from any other?
.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/xpack/ccr/action/repositories/GetCcrRestoreFileChunkAction.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some more comments. We are getting there! thanks for the iterations
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Show resolved
Hide resolved
// Should not access this method while holding global lock as that might block the cluster state | ||
// update thread on IO if it calls afterIndexShardClosed | ||
assert Thread.holdsLock(CcrRestoreSourceService.this) == false : "Should not hold CcrRestoreSourceService lock"; | ||
if (cachedInput != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the cachedInput
I think what we should do here is to have a Map<String,IndexInput>
and close it after a read if (in.getFilePointer() == in.length())
Also I wonder if the sync on this method is necessary or if we rather should not synchronized but instead have a KeyedLock<String>
. This way we can implement this:
try (Releasable lock = keyedLock.tryAcquire(fileName)) {
if (lock == null) {
throw new IllegalStateException("can't read from the same file on the same session concurrently");
}
// do the read
}
this would give us the right guarantees rather than hiding bugs.
|
||
int bytesRequested = (int) Math.min(remainingBytes, len); | ||
String fileName = fileToRecover.name(); | ||
GetCcrRestoreFileChunkRequest request = new GetCcrRestoreFileChunkRequest(node, sessionUUID, fileName, bytesRequested); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get back the offset from the remote call and then assert that is matches with the offset we have stored in pos
This would add another level of safety.
I implemented @s1monw changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM left 1 comment
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. I've left some nits and one minor comment about the offset
field.
|
||
@Override | ||
public DiscoveryNode getPreferredTargetNode() { | ||
return node; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert here that node is not-null?
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Outdated
Show resolved
Hide resolved
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Outdated
Show resolved
Hide resolved
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Outdated
Show resolved
Hide resolved
...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java
Show resolved
Hide resolved
|
||
public static class GetCcrRestoreFileChunkResponse extends ActionResponse { | ||
|
||
private final long offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset is typically the beginning of the range that is sent. Having it different here is a potential source for confusion in the future. I see two options: 1) Name it differently or 2) make it the offset that represents the beginning of the range that is sent. I'm strongly favoring 2), which is a small change to make, and provides the same validation on the receiver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with number 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You'll need to merge latest master here to get CI passing.
run gradle build tests 1 |
* master: (28 commits) Introduce retention lease serialization (elastic#37447) Update Delete Watch to allow unknown fields (elastic#37435) Make finalize step of recovery source non-blocking (elastic#37388) Update the default for include_type_name to false. (elastic#37285) Security: remove SSL settings fallback (elastic#36846) Adding mapping for hostname field (elastic#37288) Relax assertSameDocIdsOnShards assertion Reduce recovery time with compress or secure transport (elastic#36981) Implement ccr file restore (elastic#37130) Fix Eclipse specific compilation issue (elastic#37419) Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415) [ML] Use String rep of Version in map for serialisation (elastic#37416) Cleanup Deadcode in Rest Tests (elastic#37418) Mute IndexShardRetentionLeaseTests.testCommit elastic#37420 unmuted test Remove unused index store in directory service Improve CloseWhileRelocatingShardsIT (elastic#37348) Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360) Update the scroll example in the docs (elastic#37394) Update analysis.asciidoc (elastic#37404) ...
This is related to elastic#35975. It implements a file based restore in the CcrRepository. The restore transfers files from the leader cluster to the follower cluster. It does not implement any advanced resiliency features at the moment. Any request failure will end the restore.
This is related to #35975. It implements a file based restore in the CcrRepository. The restore transfers files from the leader cluster to the follower cluster. It does not implement any advanced resiliency features at the moment. Any request failure will end the restore.
A recent refactoring (#37130) where imports got mixed up (changing Lucene's IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being logged in case of restoring a fresh snapshot.
A recent refactoring (#37130) where imports got mixed up (changing Lucene's IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being logged in case of restoring a fresh snapshot.
A recent refactoring (#37130) where imports got mixed up (changing Lucene's IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being logged in case of restoring a fresh snapshot.
A recent refactoring (#37130) where imports got mixed up (changing Lucene's IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being logged in case of restoring a fresh snapshot.
This is related to #35975. It implements a file based restore in the
CcrRepository
. The restore transfers files from the leader clusterto the follower cluster. It does not implement any advanced resiliency
features at the moment. Any request failure will end the restore.