-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Abort resharding if any Resharding replica is to be marked dead #6364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe pull request refactors the Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code Graph Analysis (3)lib/collection/src/collection/clean.rs (2)
lib/collection/src/shards/replica_set/update.rs (1)
lib/collection/src/shards/replica_set/mod.rs (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
🔇 Additional comments (12)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Aborting resharding and marking is
Good point. I think I mostly thought about "resharding transfer" and "resharded/migrated points updates" when I was doing scale-down, but forgot to think how regular updates would interact with it... 😬 Anyway, I'll take a closer look at the PR now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think that's how it should have looked from the start. Great catch, and good fix, thanks!
If we abort resharding after marking as That's why now we abort resharding first and then mark as |
Regarding the previous edge case:
Also, please correct me if I'm wrong: There's a small time window between shard being marked Active (while aborting resharding) and then it gets marked as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Activating and then marking the replica as dead again is definitely a problem.
Because we release the shard holder lock before aborting, the replica state might blink to be active for a split second if there is another reader. That means that readers might rely on the replica to read corrupt state.
I'm not entirely sure yet what the best way of resolving this is, as there is a few different approaches. I may be to not blindly activate replicas on abort. I'd have to think about it for a second.
the replica down so it's not applied by this node so leader wants to mark all replicas on that resharding peer as Dead
Minor correction: the operation only marks one (the current) replica as dead, not all replicas on that node.
I suggest different handling and a few fixes. I've implemented them in this sub-PR here: #6394 |
* Force delete points when aborting resharding, even if replica is dead * Abort resharding after marking replica as dead * Only abort resharding if we mark related replica as dead * Just rely on shard replica state * Propagate shard transfer error right away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the merged changes from #6394 I think this is good to go.
…nt#6364) * Abort resharding if any Resharding replica is to be marked dead * Abort resharding before marking shard as Dead * add comments * Abort resharding after marking as dead (qdrant#6394) * Force delete points when aborting resharding, even if replica is dead * Abort resharding after marking replica as dead * Only abort resharding if we mark related replica as dead * Just rely on shard replica state * Propagate shard transfer error right away --------- Co-authored-by: Tim Visée <tim+github@visee.me>
Nodes can go down while resharding and we found an edge case where it can cause problems.
Whenever a node goes down, the leader detects this (on next update request) because of
sync_local_state()
, and it notifies other peers by appendingto the consensus WAL. Now all online nodes are expected to apply this operation locally and try to mark the resharding replica as
Dead
.However, this logic to mark
Resharding/ReshardingScaleDown
state replicas asDead
behaves like this:resharding_state.peer_id = peer_that_went_down
), we abort resharding (this also marks the shard asActive
)Resharding/ReshardingScaleDown
replica goes down, we ignore the error and hence the leader keeps on trying forever to mark theResharding/ReshardingScaleDown
replica asDead
(bloats the consensus)After this PR this logic change to:
If any node containingResharding/ReshardingScaleDown
replica goes down, we first mark it asDead
, and then we abort resharding (doesn't mark the shard asActive
anymore since it'sDead
instead ofResharding/ReshardingScaleDown
)Resharding/ReshardingScaleDown
replica goes down, we first abort resharding and then mark it asDead
.Reason for this is the following edge case:
Resharding/ReshardingScaleDown
replica goes downActive
after aborting resharding andreturn Ok()
.Active
replica state means it's expected to be in sync (won't ask for updates after it comes back online).Dead
after aborting resharding so we don't miss any updates.Also, it's better to trigger abort if any node with resharding replica goes down . Imagine if node is marked as
Dead
then on recovery, it will ask take updates from other replicas hence revert all the work done by resharding. But CM will simply ignore this problem since it relies onreplicas_migrated
(which already includes this node) and doesn't retry resharding for such replicas leaving the replicas out of sync.TODO: