-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Do not deactivate partial/recovery replica on missing point #5991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a9fc048
to
053f6ed
Compare
ffuugoo
reviewed
Feb 14, 2025
timvisee
commented
Feb 14, 2025
Comment on lines
+583
to
+589
// Ignore missing point errors if replica is in partial or recovery state | ||
// Partial or recovery state indicates that the replica is receiving a shard transfer, | ||
// it might not have received all the points yet | ||
// See: <https://github.com/qdrant/qdrant/pull/5991> | ||
if peer_state.is_partial_or_recovery() && err.is_missing_point() { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is simple: do not mark a replica as dead on a missing point error if it's in partial/recovery state.
generall
approved these changes
Feb 14, 2025
timvisee
added a commit
that referenced
this pull request
Feb 17, 2025
* Do not deactivate partial or recovery replica on non-transient error * Handle missing point edge case in handle_failed_replicas instead * We already have the replica state * Add test, move replica during constant set payload, assert shard move * Add link to pull request in comment
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A replica that is receiving a shard transfer, might not have all points yet. A replica could have missed a bunch of insertions, which might be the reason why the shard transfer is ongoing - to recover the replica.
Our replica deactivation logic was a bit to eager, immediately marking a replica as dead if a point is not found.
Marking as dead cascades into other problems. If we mark the target shard of a transfer as dead, the transfer is immediately aborted. If we have a continuous stream of operations modifying points that aren't there yet, the transfer will never progress because it constantly gets aborted. It is exactly the scenario we've been seeing in a real cluster.
This PR relaxes the condition. It accepts a missing point error while the replica is undergoing recovery. We should accept it because it might not have all points yet. It prevents us from marking the replica as dead too eagerly, cascading into other problems. This change is safe because the ongoing shard transfer will take care of ensuring data consistency. If our recovering shard might not apply an incoming update now, it will apply it (again) later as part of the transfer.
I've added a test that asserts the new behavior. The test fails when running it without the changes in this PR. That shows the new behavior works as expected.
Tasks
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?