Skip to content

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented Feb 14, 2025

A replica that is receiving a shard transfer, might not have all points yet. A replica could have missed a bunch of insertions, which might be the reason why the shard transfer is ongoing - to recover the replica.

Our replica deactivation logic was a bit to eager, immediately marking a replica as dead if a point is not found.

Marking as dead cascades into other problems. If we mark the target shard of a transfer as dead, the transfer is immediately aborted. If we have a continuous stream of operations modifying points that aren't there yet, the transfer will never progress because it constantly gets aborted. It is exactly the scenario we've been seeing in a real cluster.

This PR relaxes the condition. It accepts a missing point error while the replica is undergoing recovery. We should accept it because it might not have all points yet. It prevents us from marking the replica as dead too eagerly, cascading into other problems. This change is safe because the ongoing shard transfer will take care of ensuring data consistency. If our recovering shard might not apply an incoming update now, it will apply it (again) later as part of the transfer.

I've added a test that asserts the new behavior. The test fails when running it without the changes in this PR. That shows the new behavior works as expected.

Tasks

  • Decide whether this change is correct
  • Add test

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee force-pushed the fix-transfer-stop-on-partial-error branch from a9fc048 to 053f6ed Compare February 14, 2025 15:43
@timvisee timvisee changed the title Do not deactivate partial/recovery replica on non-transient error Do not deactivate partial/recovery replica on missing point Feb 14, 2025
@timvisee timvisee requested review from generall and ffuugoo February 14, 2025 16:46
Comment on lines +583 to +589
// Ignore missing point errors if replica is in partial or recovery state
// Partial or recovery state indicates that the replica is receiving a shard transfer,
// it might not have received all the points yet
// See: <https://github.com/qdrant/qdrant/pull/5991>
if peer_state.is_partial_or_recovery() && err.is_missing_point() {
continue;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is simple: do not mark a replica as dead on a missing point error if it's in partial/recovery state.

@timvisee timvisee marked this pull request as ready for review February 14, 2025 16:47
@timvisee timvisee merged commit 36fd28a into dev Feb 17, 2025
17 checks passed
@timvisee timvisee deleted the fix-transfer-stop-on-partial-error branch February 17, 2025 08:42
timvisee added a commit that referenced this pull request Feb 17, 2025
* Do not deactivate partial or recovery replica on non-transient error

* Handle missing point edge case in handle_failed_replicas instead

* We already have the replica state

* Add test, move replica during constant set payload, assert shard move

* Add link to pull request in comment
@timvisee timvisee mentioned this pull request Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants