Fix bug that causes all replicas to die if node is restarted during resharding #6800

KShivendu · 2025-07-03T08:18:47Z

Fixing the bug found in chaos testing debug cluster (reproduced in a test here). It led to a scenario where all the replicas of a shard get marked as Dead on all nodes and they can't recover from each other since every replica is Dead. The root cause was exactly what I hypothesised here.

Put simply if you:

Have resharding shard streaming transfer between: n1 shard 3 -> n2 shard 2
Then if node n0 dies and also had an Resharding/ReshardingScaleDown replica (say shard 1), we abort resharding and abort any streaming resharding transfers related to n0 (but we don't abort the above mentioned n1:3 -> n2:2 transfer)
When the n1:3 -> n2:2 is finished, every node sees that resharding isn't ongoing anymore and hence each node marks their Shard 2 as Dead
Optional: Even if n0 now comes back online, it will also redo the same operation and also mark it's shard 2 as Dead (if it exists locally)

This PR adds a simple fix that after ensuring basic checks (like the cluster hasn't gone beyond ReadHashRingCommitted resharding stage), it will abort any resharding transfer (method == ReshardingStreamRecords) ongoing in the cluster.

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

KShivendu · 2025-07-03T08:19:38Z

When the n1:3 -> n2:2 is finished, every node sees that resharding isn't ongoing anymore and hence each node marks their Shard 2 as Dead

This also seems like a problematic logic, other nodes shouldn't mark their own local shard 2 replicas as Dead when the transfer was only finished for n2:2. I'll cross check once.

lib/collection/src/collection/resharding.rs

timvisee · 2025-07-15T15:08:47Z

lib/collection/src/collection/resharding.rs

+        // We need to abort all resharding transfers
+        let resharding_transfers = shard_holder
+            .get_transfers(|t| t.method == Some(ShardTransferMethod::ReshardingStreamRecords));
+        for transfer in resharding_transfers {
+            let _is_resharding_transfer = self
+                .abort_shard_transfer_and_report_resharding(transfer.key(), &shard_holder)
+                .await?;
+            // We don't need to abort resharding again since we are already aborting it in next step
+        }


Is it necessary to also 'report to resharding' here? Aren't we already running the abort resharding logic here?

The same function in used here where we need this info.

timvisee · 2025-07-15T15:19:11Z

lib/collection/src/collection/shard_transfer.rs

-    /// 4. Remove temp shard, or mark it as dead
-    pub async fn abort_shard_transfer(
+    /// Return if it was a resharding transfer so it can be handled correctly (aborted or ignored)
+    pub async fn abort_shard_transfer_and_report_resharding(


Is this only called _and_report_resharding because it returns a boolean defining whether it was a resharding transfer?

…esharding

timvisee

Looks good, together with #6881.

@KShivendu is doing another test on the side. Once it passes we can merge.

KShivendu · 2025-07-16T12:19:51Z

CI passed in our CM test https://github.com/qdrant/cluster-manager/pull/354. Merging this! 🥳

* Apply suggestions to fix resharding dead replicas * fmt

…esharding (#6800) * Fix bug that causes all replicas to die if node is restarted during resharding * fix recursive async problem * Avoid write lock unless required * avoid using &mut when & is sufficient * Remove is_in_progress since check_abort_resharding exists (#6806) * On resharding abort, only abort transfers related to current operation * Fix resharding dead replicas improvements (#6881) * Apply suggestions to fix resharding dead replicas * fmt --------- Co-authored-by: timvisee <tim@visee.me>

This comment was marked as resolved.

Sign in to view

timvisee reviewed Jul 3, 2025

View reviewed changes

lib/collection/src/collection/resharding.rs Outdated Show resolved Hide resolved

lib/collection/src/collection/resharding.rs Outdated Show resolved Hide resolved

KShivendu mentioned this pull request Jul 4, 2025

Remove is_in_progress since check_abort_resharding exists #6806

Merged

3 tasks

KShivendu force-pushed the fix-resharding-dead-replicas branch from 9bdc13d to 50b476a Compare July 4, 2025 12:08

timvisee reviewed Jul 15, 2025

View reviewed changes

KShivendu mentioned this pull request Jul 16, 2025

Fix resharding dead replicas improvements #6881

Merged

3 tasks

KShivendu and others added 6 commits July 16, 2025 15:38

Fix bug that causes all replicas to die if node is restarted during r…

5480e60

…esharding

fix recursive async problem

83c9294

Avoid write lock unless required

2de0c24

avoid using &mut when & is sufficient

d3197e5

Remove is_in_progress since check_abort_resharding exists (#6806)

fbb1122

On resharding abort, only abort transfers related to current operation

9516880

KShivendu force-pushed the fix-resharding-dead-replicas branch from b4ef5ff to 9516880 Compare July 16, 2025 10:08

github-actions bot mentioned this pull request Jul 16, 2025

Flaky test tests::snapshot_test::test_snapshot_collection_normal #5133

Open

timvisee approved these changes Jul 16, 2025

View reviewed changes

Fix resharding dead replicas improvements (#6881)

e53e354

* Apply suggestions to fix resharding dead replicas * fmt

KShivendu merged commit e14a003 into dev Jul 16, 2025
13 checks passed

KShivendu deleted the fix-resharding-dead-replicas branch July 16, 2025 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug that causes all replicas to die if node is restarted during resharding #6800

Fix bug that causes all replicas to die if node is restarted during resharding #6800

Uh oh!

KShivendu commented Jul 3, 2025 •

edited

Loading

Uh oh!

KShivendu commented Jul 3, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timvisee Jul 15, 2025

Uh oh!

KShivendu Jul 15, 2025

Uh oh!

timvisee Jul 15, 2025

Uh oh!

KShivendu Jul 15, 2025

Uh oh!

timvisee left a comment

Uh oh!

KShivendu commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Fix bug that causes all replicas to die if node is restarted during resharding #6800

Fix bug that causes all replicas to die if node is restarted during resharding #6800

Uh oh!

Conversation

KShivendu commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

Uh oh!

KShivendu commented Jul 3, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timvisee Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

KShivendu commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

KShivendu commented Jul 3, 2025 •

edited

Loading