Do not deactivate partial/recovery replica on missing point #5991

timvisee · 2025-02-14T10:53:16Z

A replica that is receiving a shard transfer, might not have all points yet. A replica could have missed a bunch of insertions, which might be the reason why the shard transfer is ongoing - to recover the replica.

Our replica deactivation logic was a bit to eager, immediately marking a replica as dead if a point is not found.

Marking as dead cascades into other problems. If we mark the target shard of a transfer as dead, the transfer is immediately aborted. If we have a continuous stream of operations modifying points that aren't there yet, the transfer will never progress because it constantly gets aborted. It is exactly the scenario we've been seeing in a real cluster.

This PR relaxes the condition. It accepts a missing point error while the replica is undergoing recovery. We should accept it because it might not have all points yet. It prevents us from marking the replica as dead too eagerly, cascading into other problems. This change is safe because the ongoing shard transfer will take care of ensuring data consistency. If our recovering shard might not apply an incoming update now, it will apply it (again) later as part of the transfer.

I've added a test that asserts the new behavior. The test fails when running it without the changes in this PR. That shows the new behavior works as expected.

Tasks

Decide whether this change is correct
Add test

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

lib/collection/src/shards/replica_set/update.rs

timvisee · 2025-02-14T16:46:57Z

lib/collection/src/shards/replica_set/update.rs

+            // Ignore missing point errors if replica is in partial or recovery state
+            // Partial or recovery state indicates that the replica is receiving a shard transfer,
+            // it might not have received all the points yet
+            // See: <https://github.com/qdrant/qdrant/pull/5991>
+            if peer_state.is_partial_or_recovery() && err.is_missing_point() {
+                continue;
+            }


The change is simple: do not mark a replica as dead on a missing point error if it's in partial/recovery state.

* Do not deactivate partial or recovery replica on non-transient error * Handle missing point edge case in handle_failed_replicas instead * We already have the replica state * Add test, move replica during constant set payload, assert shard move * Add link to pull request in comment

timvisee added 2 commits February 14, 2025 16:42

Do not deactivate partial or recovery replica on non-transient error

e3b61e9

Handle missing point edge case in handle_failed_replicas instead

053f6ed

timvisee force-pushed the fix-transfer-stop-on-partial-error branch from a9fc048 to 053f6ed Compare February 14, 2025 15:43

timvisee changed the title ~~Do not deactivate partial/recovery replica on non-transient error~~ Do not deactivate partial/recovery replica on missing point Feb 14, 2025

ffuugoo reviewed Feb 14, 2025

View reviewed changes

lib/collection/src/shards/replica_set/update.rs Outdated Show resolved Hide resolved

timvisee added 3 commits February 14, 2025 16:55

We already have the replica state

8894616

Add test, move replica during constant set payload, assert shard move

bc6a6e8

Add link to pull request in comment

d642c13

timvisee requested review from generall and ffuugoo February 14, 2025 16:46

timvisee commented Feb 14, 2025

View reviewed changes

timvisee marked this pull request as ready for review February 14, 2025 16:47

generall approved these changes Feb 14, 2025

View reviewed changes

timvisee merged commit 36fd28a into dev Feb 17, 2025
17 checks passed

timvisee deleted the fix-transfer-stop-on-partial-error branch February 17, 2025 08:42

timvisee mentioned this pull request Feb 17, 2025

Bump version to 1.13.4 #5995

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not deactivate partial/recovery replica on missing point #5991

Do not deactivate partial/recovery replica on missing point #5991

Uh oh!

timvisee commented Feb 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

timvisee Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

Do not deactivate partial/recovery replica on missing point #5991

Do not deactivate partial/recovery replica on missing point #5991

Uh oh!

Conversation

timvisee commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

All Submissions:

New Feature Submissions:

Uh oh!

Uh oh!

timvisee Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timvisee commented Feb 14, 2025 •

edited

Loading