-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Abort transfer on force peer delete #6507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort transfer on force peer delete #6507
Conversation
@@ -461,6 +461,26 @@ impl TableOfContent { | |||
false | |||
} | |||
|
|||
pub async fn abort_peer_transfers(&self, peer_id: PeerId) -> CollectionResult<()> { | |||
for collection in self.collections.read().await.values() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to abort both to and from. Because it also does not make sense to transfer to a peer that is being removed.
Then I recommend to also explicitly abort the shard transfer task. Because removing it from the state here, does not stop the ongoing transfer task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because removing it from the state here, does not stop the ongoing transfer task
Ahh right. I didn't realise that register_abort_transfer()
is local and it's not pushing it to consensus. Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(reposted as a separate comment)
src/actix/api/cluster_api.rs
Outdated
// If we reach here, either there are no shards or force removal is requested | ||
if has_shards { | ||
log::warn!("Aborting peer transfers for peer {peer_id} before force removal"); | ||
toc.abort_peer_transfers(dispatcher.clone(), access, peer_id) | ||
.await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this should be done differently. Instead of explicitly submitting ShardTransferOperations::Abort
in the API, we should modify the code that is called when RemovePeer
consensus operation is applied, and make it abort transfers as part of RemovePeer
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot better! Simple and effective.
* Test for reproducing force delete peer bug * drop seed port * use larger transfers to ensure killing before transfer is completed * simplify test cases * kill and wait for same commit before force delete * Don't kill to avoid breaking consesnsus before force delete peer * Abort transfer on force peer delete (#6507) * Abort transfer on force peer delete * ToC changes to abort internally * Abort shard transfer using consensus not just locally * fmt * fix fmt * Abort transfers internally when shard is being removed * remove unused imports
* Test for reproducing force delete peer bug * drop seed port * use larger transfers to ensure killing before transfer is completed * simplify test cases * kill and wait for same commit before force delete * Don't kill to avoid breaking consesnsus before force delete peer * Abort transfer on force peer delete (#6507) * Abort transfer on force peer delete * ToC changes to abort internally * Abort shard transfer using consensus not just locally * fmt * fix fmt * Abort transfers internally when shard is being removed * remove unused imports
Force deletes to a peer should also abort all shards transfers from this peer. Otherwise, they can remain stuck there forever.
This makes the test introduced in #6505 to pass
All Submissions:
dev
branch. Did you create your branch fromdev
?