-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix bug that causes all replicas to die if node is restarted during resharding #6800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This also seems like a problematic logic, other nodes shouldn't mark their own local shard 2 replicas as |
This comment was marked as resolved.
This comment was marked as resolved.
9bdc13d
to
50b476a
Compare
// We need to abort all resharding transfers | ||
let resharding_transfers = shard_holder | ||
.get_transfers(|t| t.method == Some(ShardTransferMethod::ReshardingStreamRecords)); | ||
for transfer in resharding_transfers { | ||
let _is_resharding_transfer = self | ||
.abort_shard_transfer_and_report_resharding(transfer.key(), &shard_holder) | ||
.await?; | ||
// We don't need to abort resharding again since we are already aborting it in next step | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to also 'report to resharding' here? Aren't we already running the abort resharding logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same function in used here where we need this info.
/// 4. Remove temp shard, or mark it as dead | ||
pub async fn abort_shard_transfer( | ||
/// Return if it was a resharding transfer so it can be handled correctly (aborted or ignored) | ||
pub async fn abort_shard_transfer_and_report_resharding( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only called _and_report_resharding
because it returns a boolean defining whether it was a resharding transfer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
b4ef5ff
to
9516880
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, together with #6881.
@KShivendu is doing another test on the side. Once it passes we can merge.
CI passed in our CM test https://github.com/qdrant/cluster-manager/pull/354. Merging this! 🥳 |
* Apply suggestions to fix resharding dead replicas * fmt
…esharding (#6800) * Fix bug that causes all replicas to die if node is restarted during resharding * fix recursive async problem * Avoid write lock unless required * avoid using &mut when & is sufficient * Remove is_in_progress since check_abort_resharding exists (#6806) * On resharding abort, only abort transfers related to current operation * Fix resharding dead replicas improvements (#6881) * Apply suggestions to fix resharding dead replicas * fmt --------- Co-authored-by: timvisee <tim@visee.me>
Fixing the bug found in chaos testing debug cluster (reproduced in a test here). It led to a scenario where all the replicas of a shard get marked as Dead on all nodes and they can't recover from each other since every replica is Dead. The root cause was exactly what I hypothesised here.
Put simply if you:
Resharding/ReshardingScaleDown
replica (say shard 1), we abort resharding and abort any streaming resharding transfers related to n0 (but we don't abort the above mentioned n1:3 -> n2:2 transfer)This PR adds a simple fix that after ensuring basic checks (like the cluster hasn't gone beyond
ReadHashRingCommitted
resharding stage), it will abort any resharding transfer (method == ReshardingStreamRecords
) ongoing in the cluster.All Submissions:
dev
branch. Did you create your branch fromdev
?