Skip to content

Conversation

KShivendu
Copy link
Member

@KShivendu KShivendu commented Apr 10, 2025

Nodes can go down while resharding and we found an edge case where it can cause problems.

Whenever a node goes down, the leader detects this (on next update request) because of sync_local_state(), and it notifies other peers by appending

CollectionMeta(SetShardReplicaState(SetShardReplicaState { collection_name: "benchmark", shard_id: 0, peer_id: XYZ, state: Dead, from_state: None }))

to the consensus WAL. Now all online nodes are expected to apply this operation locally and try to mark the resharding replica as Dead.

However, this logic to mark Resharding/ReshardingScaleDown state replicas as Dead behaves like this:

  • If node that went down was the main resharding replica (resharding_state.peer_id = peer_that_went_down), we abort resharding (this also marks the shard as Active)
  • If other nodes with Resharding/ReshardingScaleDown replica goes down, we ignore the error and hence the leader keeps on trying forever to mark the Resharding/ReshardingScaleDown replica as Dead (bloats the consensus)

After this PR this logic change to:

  • If any node containing Resharding/ReshardingScaleDown replica goes down, we first mark it as Dead, and then we abort resharding (doesn't mark the shard as Active anymore since it's Dead instead of Resharding/ReshardingScaleDown)
  • If any node containing Resharding/ReshardingScaleDown replica goes down, we first abort resharding and then mark it as Dead.

Reason for this is the following edge case:

  • Imagine a node with Resharding/ReshardingScaleDown replica goes down
  • An update X comes in and the replica down so it's not applied by this node.
  • However, we currently mark the replica as Active after aborting resharding and return Ok(). Active replica state means it's expected to be in sync (won't ask for updates after it comes back online).
  • Hence, it's better to mark it as Dead after aborting resharding so we don't miss any updates.

Also, it's better to trigger abort if any node with resharding replica goes down . Imagine if node is marked as Dead then on recovery, it will ask take updates from other replicas hence revert all the work done by resharding. But CM will simply ignore this problem since it relies on replicas_migrated (which already includes this node) and doesn't retry resharding for such replicas leaving the replicas out of sync.

TODO:

  • Add tests

Copy link
Contributor

coderabbitai bot commented Apr 10, 2025

📝 Walkthrough

Walkthrough

The pull request refactors the set_shard_replica_state method in the Collection implementation to always update the replica state before handling resharding aborts and shard transfer aborts. Previously, if the current replica state was a resharding state and the new state was Dead, the method would drop the shard holder lock, abort resharding if applicable, and return early without updating the replica state. The updated method first updates the replica state by calling ensure_replica_with_state, then, if the new state is Dead, collects related shard transfers and drops the shard holder lock. It conditionally aborts resharding if the current state is resharding and aborts all related shard transfers, propagating any errors. The early return after aborting resharding is removed, ensuring the replica state update always occurs first. Additional changes include adding a new force boolean parameter to several update_local method calls across multiple modules, allowing forced updates regardless of replica state, and consolidating resharding state checks via a dedicated method. No changes were made to the declarations of exported or public entities except for the addition of the force parameter in relevant method signatures.

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2e248c and c68280c.

📒 Files selected for processing (8)
  • lib/collection/src/collection/clean.rs (1 hunks)
  • lib/collection/src/collection/mod.rs (1 hunks)
  • lib/collection/src/collection/point_ops.rs (2 hunks)
  • lib/collection/src/collection/sharding_keys.rs (1 hunks)
  • lib/collection/src/shards/replica_set/mod.rs (2 hunks)
  • lib/collection/src/shards/replica_set/update.rs (4 hunks)
  • lib/collection/src/shards/shard_holder/resharding.rs (2 hunks)
  • lib/collection/src/tests/points_dedup.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/collection/src/collection/mod.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/collection/src/collection/clean.rs (2)
lib/common/common/src/counter/hardware_counter.rs (1)
  • disposable (60-73)
lib/common/common/src/counter/hardware_accumulator.rs (1)
  • disposable (122-128)
lib/collection/src/shards/replica_set/update.rs (1)
lib/collection/src/operations/mod.rs (1)
  • force (108-111)
lib/collection/src/shards/replica_set/mod.rs (1)
lib/collection/src/operations/mod.rs (1)
  • force (108-111)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (12)
lib/collection/src/collection/sharding_keys.rs (1)

127-132: Added required force parameter to update_local call.

The addition of false as the fourth parameter to update_local correctly implements the updated method signature. The false value indicates that this update operation should respect the usual replica state constraints and not force an update regardless of state.

lib/collection/src/collection/clean.rs (1)

336-343: Added required force parameter to update_local call.

The addition of false as the fourth parameter matches the updated update_local method signature. This is consistent with the non-forced update approach used across the codebase for regular point deletion operations.

lib/collection/src/collection/point_ops.rs (2)

53-58: Added required force parameter to update_local call.

The addition of false as the fourth parameter to update_local matches the updated method signature. This ensures that updates to all local shards follow the standard state-check behavior.


107-107: Added required force parameter to update_local call for weak ordering.

This change correctly adds the required fourth parameter to the update_local method call, maintaining consistency with the updated method signature throughout the codebase.

lib/collection/src/tests/points_dedup.rs (1)

137-138: Updated test code with required force parameter.

The test has been correctly updated to include the new false parameter in the update_local call, ensuring that the test continues to work with the updated method signature.

lib/collection/src/shards/shard_holder/resharding.rs (2)

318-318: Good refactoring using the is_resharding() method.

The code now leverages the dedicated is_resharding() method which encapsulates the check for both Resharding and ReshardingScaleDown states, making the code more maintainable and aligned with the DRY principle.


329-334: Support for forced deletion when a replica may be dead.

The added comment and forced deletion are necessary improvements to handle scenarios where a replica might be dead during resharding abort. This ensures that the cleanup process completes successfully even if the replica is in a Dead state.

The change aligns perfectly with the PR objective to correctly abort resharding when any resharding replica is marked as dead. This prevents consensus log bloat from repeated marking attempts.

lib/collection/src/shards/replica_set/mod.rs (2)

894-897: Added force parameter to support unconditional deletion.

The force boolean parameter is a well-designed addition that allows point deletion to bypass state checks when necessary, particularly during resharding abort cleanup. This change propagates properly to the update_local call and is consistent with the overall PR objective.


954-954: Correctly passing the force parameter to update_local.

The force parameter is properly forwarded to the update_local method, ensuring the deletion operation can be forced when needed. This is essential for the resharding abort logic to clean up properly when replicas are in a Dead state.

lib/collection/src/shards/replica_set/update.rs (3)

27-39: Well-documented forced update behavior.

The added documentation clearly explains the purpose and constraints of the force parameter - that it allows operations to be applied unconditionally regardless of replica state, and that it's for internal use only. This is good practice that helps prevent misuse.


54-56: Optimized hardware measurement handling during resharding.

The improved condition avoids unnecessarily replacing the hardware measurement accumulator when it's already disposable. This is a small but meaningful optimization that reduces unnecessary object creation.


65-66: Critical addition for forced operations regardless of replica state.

This new match arm is the core implementation that allows operations to bypass normal state checks when force is true. This is essential for the resharding abort cleanup process to work correctly with dead replicas, addressing the central issue in the PR.

The implementation correctly places this condition before the specific state checks, ensuring it takes precedence and allows operations to be applied unconditionally when necessary.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@timvisee timvisee requested a review from ffuugoo April 10, 2025 14:53
@ffuugoo
Copy link
Contributor

ffuugoo commented Apr 11, 2025

Aborting resharding and marking is Active was intentional, because

  • in case of resharding-up, if we abort resharding, we simply remove the whole shard, we don't even need to bother with Dead
  • in case of resharding-down, my logic was that we migrate additional points into existing shard, so if we abort resharding, the node still have all original points, so we can switch it back to Active
  • but seems like I did not think it completely through 😅

An update X comes in and the replica down so it's not applied by this node

Good point. I think I mostly thought about "resharding transfer" and "resharded/migrated points updates" when I was doing scale-down, but forgot to think how regular updates would interact with it... 😬

Anyway, I'll take a closer look at the PR now.

Copy link
Contributor

@ffuugoo ffuugoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think that's how it should have looked from the start. Great catch, and good fix, thanks!

@KShivendu
Copy link
Member Author

KShivendu commented Apr 15, 2025

If we abort resharding after marking as Dead and abort resharding afterwards. It (abort_resharding) calls delete_local_points which fails (because of this) and the resharding peer never drops the resharding_state. This leads to resharding getting stuck (since states are out of sync)

That's why now we abort resharding first and then mark as Dead

@KShivendu
Copy link
Member Author

KShivendu commented Apr 15, 2025

Regarding the previous edge case:

  • Imagine a node with Resharding/ReshardingScaleDown replica goes down
  • An update X comes in and the replica down so it's not applied by this node so leader wants to mark all replicas on that resharding peer as Dead
  • So we abort resharding (mark all replica except as Active) and then marks it as Dead
  • This means when the node comes back online, it will request for updates -> so it works as expected

Also, please correct me if I'm wrong: There's a small time window between shard being marked Active (while aborting resharding) and then it gets marked as Dead, if node goes down between this again, the consensus operation is not applied properly and hence is retried when the node comes back online again. Or will it be ignored? (i.e. we just assume it's applied even if it fails)

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activating and then marking the replica as dead again is definitely a problem.

Because we release the shard holder lock before aborting, the replica state might blink to be active for a split second if there is another reader. That means that readers might rely on the replica to read corrupt state.

I'm not entirely sure yet what the best way of resolving this is, as there is a few different approaches. I may be to not blindly activate replicas on abort. I'd have to think about it for a second.

the replica down so it's not applied by this node so leader wants to mark all replicas on that resharding peer as Dead

Minor correction: the operation only marks one (the current) replica as dead, not all replicas on that node.

@timvisee
Copy link
Member

I suggest different handling and a few fixes. I've implemented them in this sub-PR here: #6394

* Force delete points when aborting resharding, even if replica is dead

* Abort resharding after marking replica as dead

* Only abort resharding if we mark related replica as dead

* Just rely on shard replica state

* Propagate shard transfer error right away
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the merged changes from #6394 I think this is good to go.

@KShivendu KShivendu merged commit 751fb5a into dev Apr 16, 2025
17 checks passed
@KShivendu KShivendu deleted the abort-resharding-on-replica-dead branch April 16, 2025 16:49
@KShivendu KShivendu mentioned this pull request Apr 17, 2025
3 tasks
pull bot pushed a commit to kp-forks/qdrant that referenced this pull request Apr 21, 2025
…nt#6364)

* Abort resharding if any Resharding replica is to be marked dead

* Abort resharding before marking shard as Dead

* add comments

* Abort resharding after marking as dead (qdrant#6394)

* Force delete points when aborting resharding, even if replica is dead

* Abort resharding after marking replica as dead

* Only abort resharding if we mark related replica as dead

* Just rely on shard replica state

* Propagate shard transfer error right away

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
@coderabbitai coderabbitai bot mentioned this pull request Jun 27, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants