Skip to content

Conversation

KShivendu
Copy link
Member

@KShivendu KShivendu commented Jul 3, 2025

Fixing the bug found in chaos testing debug cluster (reproduced in a test here). It led to a scenario where all the replicas of a shard get marked as Dead on all nodes and they can't recover from each other since every replica is Dead. The root cause was exactly what I hypothesised here.

Put simply if you:

  • Have resharding shard streaming transfer between: n1 shard 3 -> n2 shard 2
  • Then if node n0 dies and also had an Resharding/ReshardingScaleDown replica (say shard 1), we abort resharding and abort any streaming resharding transfers related to n0 (but we don't abort the above mentioned n1:3 -> n2:2 transfer)
  • When the n1:3 -> n2:2 is finished, every node sees that resharding isn't ongoing anymore and hence each node marks their Shard 2 as Dead
  • Optional: Even if n0 now comes back online, it will also redo the same operation and also mark it's shard 2 as Dead (if it exists locally)

This PR adds a simple fix that after ensuring basic checks (like the cluster hasn't gone beyond ReadHashRingCommitted resharding stage), it will abort any resharding transfer (method == ReshardingStreamRecords) ongoing in the cluster.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

@KShivendu
Copy link
Member Author

When the n1:3 -> n2:2 is finished, every node sees that resharding isn't ongoing anymore and hence each node marks their Shard 2 as Dead

This also seems like a problematic logic, other nodes shouldn't mark their own local shard 2 replicas as Dead when the transfer was only finished for n2:2. I'll cross check once.

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Comment on lines 217 to 225
// We need to abort all resharding transfers
let resharding_transfers = shard_holder
.get_transfers(|t| t.method == Some(ShardTransferMethod::ReshardingStreamRecords));
for transfer in resharding_transfers {
let _is_resharding_transfer = self
.abort_shard_transfer_and_report_resharding(transfer.key(), &shard_holder)
.await?;
// We don't need to abort resharding again since we are already aborting it in next step
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to also 'report to resharding' here? Aren't we already running the abort resharding logic here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same function in used here where we need this info.

/// 4. Remove temp shard, or mark it as dead
pub async fn abort_shard_transfer(
/// Return if it was a resharding transfer so it can be handled correctly (aborted or ignored)
pub async fn abort_shard_transfer_and_report_resharding(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only called _and_report_resharding because it returns a boolean defining whether it was a resharding transfer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, together with #6881.

@KShivendu is doing another test on the side. Once it passes we can merge.

@KShivendu
Copy link
Member Author

CI passed in our CM test https://github.com/qdrant/cluster-manager/pull/354. Merging this! 🥳

* Apply suggestions to fix resharding dead replicas

* fmt
@KShivendu KShivendu merged commit e14a003 into dev Jul 16, 2025
13 checks passed
@KShivendu KShivendu deleted the fix-resharding-dead-replicas branch July 16, 2025 12:21
generall pushed a commit that referenced this pull request Jul 17, 2025
…esharding (#6800)

* Fix bug that causes all replicas to die if node is restarted during resharding

* fix recursive async problem

* Avoid write lock unless required

* avoid using &mut when & is sufficient

* Remove is_in_progress since check_abort_resharding exists (#6806)

* On resharding abort, only abort transfers related to current operation

* Fix resharding dead replicas improvements (#6881)

* Apply suggestions to fix resharding dead replicas

* fmt

---------

Co-authored-by: timvisee <tim@visee.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants