Skip to content

Conversation

KShivendu
Copy link
Member

Add new test for force deleting peer while transfers are ongoing. Currently, it looks like Qdrant doesn't abort shard transfers if the source peer is force deleted.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Copy link
Contributor

coderabbitai bot commented May 8, 2025

📝 Walkthrough

"""

Walkthrough

This change introduces a new integration test in the consensus test suite that verifies forced deletion of a peer during active shard transfers in a distributed cluster. The test sets up a three-peer cluster with two shards and three replicas, creates a distributed collection, and inserts data to establish state. It simulates a peer failure by killing one peer process, triggers shard recovery with additional data insertion, and restarts the killed peer to initiate shard transfers. The test forcibly deletes the source peer of a shard transfer via the cluster API with a force flag and asserts that ongoing shard transfers are aborted. Additionally, the Collection struct's remove_shards_at_peer method is modified to abort all ongoing shard transfers related to the peer before removing shards.

Suggested reviewers

  • agourlay
  • ffuugoo
    """

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb0b5c and 993f06a.

📒 Files selected for processing (1)
  • lib/collection/src/collection/mod.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (1)
lib/collection/src/collection/mod.rs (1)

583-585: Good addition to abort transfers before removing shards at peer

This change properly ensures that all ongoing shard transfers related to a peer are aborted before removing the peer's shards. This fixes the issue where shard transfers weren't being aborted when a peer was forcibly deleted.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
tests/consensus_tests/test_force_delete_peer.py (5)

11-16: Add docstring for helper function.

This function is well-implemented, but lacks documentation explaining its purpose and behavior.

 def force_delete_peer(peer_api_uri: str, peer_id: int):
+    """
+    Force deletes a peer from the cluster.
+    
+    Args:
+        peer_api_uri: URI of the peer API to send the request to
+        peer_id: ID of the peer to delete
+    
+    Raises:
+        AssertionError: If the request fails
+    """
     response = requests.delete(
         f"{peer_api_uri}/cluster/peer/{peer_id}?force=true",
     )
     assert response.status_code == 200, f"Failed to force delete peer: {response.text}"
🧰 Tools
🪛 Ruff (0.8.2)

12-12: requests may be undefined, or defined from star imports

(F405)


18-22: Add docstring for get_peer_id function.

Similar to the previous function, this helper function would benefit from documentation explaining its purpose and return value.

 def get_peer_id(peer_api_uri: str) -> int:
+    """
+    Gets the peer ID from the cluster info endpoint.
+    
+    Args:
+        peer_api_uri: URI of the peer API to send the request to
+        
+    Returns:
+        int: The ID of the peer
+        
+    Raises:
+        AssertionError: If the request fails
+    """
     response = requests.get(f"{peer_api_uri}/cluster")
     assert response.status_code == 200, f"Failed to get peer ID: {response.text}"
     return response.json()["result"]["peer_id"]
🧰 Tools
🪛 Ruff (0.8.2)

19-19: requests may be undefined, or defined from star imports

(F405)


42-46: Add comment explaining which peer is being killed.

The comment "Kill last peer" is somewhat ambiguous. Add more context about why this specific peer is chosen and what effect it will have.

-    # Now
-    # Kill last peer
+    # Kill the last peer to simulate a node failure
+    # This will cause its replicas to be marked as dead
     p = processes.pop()
     p.kill()
     sleep(1)  # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

44-44: processes may be undefined, or defined from star imports

(F405)


48-49: Clarify the purpose of additional data insertion.

The comment doesn't fully explain why inserting more points triggers recovery transfer.

-    # Upsert points to mark dummy replica as dead, that will trigger recovery transfer
+    # Upsert points to trigger write operations that will mark the killed peer's replicas
+    # as dead, which will then trigger recovery transfers when the peer is restarted
     upsert_random_points(peer_api_uris[0], 100)

78-84: Add assertion error message and consider using wait function.

Consider adding an error message to make test failures more informative, and consider using a wait function instead of a fixed sleep to make the test more reliable.

-    sleep(1)
+    # Wait a moment for the force delete operation to be processed
+    sleep(1)
 
     # We expect transfers to be aborted
     transfers = get_collection_cluster_info(peer_api_uris[0], COLLECTION_NAME)[
         "shard_transfers"
     ]
-    assert len(transfers) == 0
+    assert len(transfers) == 0, "Shard transfers were not aborted after force deleting the source peer"

Consider using a wait function for more reliable testing:

def wait_for_no_transfers(peer_api_uri: str, collection_name: str, timeout: int = 10):
    """Wait until there are no shard transfers for the collection."""
    start_time = time.time()
    while time.time() - start_time < timeout:
        transfers = get_collection_cluster_info(peer_api_uri, collection_name)["shard_transfers"]
        if len(transfers) == 0:
            return
        sleep(0.5)
    raise TimeoutError(f"Transfers were not aborted within {timeout} seconds")

# Then use it instead of sleep + assert
wait_for_no_transfers(peer_api_uris[0], COLLECTION_NAME)
🧰 Tools
🪛 Ruff (0.8.2)

81-81: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0b48a and 6365d82.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (6)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


12-12: requests may be undefined, or defined from star imports

(F405)


19-19: requests may be undefined, or defined from star imports

(F405)


24-24: assert_project_root may be undefined, or defined from star imports

(F405)


27-27: start_cluster may be undefined, or defined from star imports

(F405)


30-30: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


44-44: processes may be undefined, or defined from star imports

(F405)


52-52: start_peer may be undefined, or defined from star imports

(F405)


59-59: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


61-61: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


71-71: processes may be undefined, or defined from star imports

(F405)


72-72: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)


81-81: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
🔇 Additional comments (4)
tests/consensus_tests/test_force_delete_peer.py (4)

6-9: LGTM - Clear constant definitions.

The constants are well-defined with descriptive names that indicate their purpose in the test.


58-65: LGTM - Good verification of shard transfer state.

This section correctly waits for and verifies the expected shard transfer activity.

🧰 Tools
🪛 Ruff (0.8.2)

59-59: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


61-61: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


75-76: LGTM - Force deletion of peer.

This section correctly implements the force deletion of the source peer, which is the key operation being tested.


1-85:

❓ Verification inconclusive

Test verifies incorrect behavior.

Based on the PR objective, this test aims to verify that "Qdrant does not abort shard transfers if the source peer is force deleted." However, the test is written to expect that transfers are aborted (line 84: assert len(transfers) == 0). This contradicts the stated behavior in the PR description.

Please verify whether the test is:

  1. Correctly implementing a new desired behavior (aborting transfers when a peer is force deleted), or
  2. Incorrectly expecting the opposite of the current behavior described in the PR

🏁 Script executed:

#!/bin/bash
# Look for any existing behavior or related issues in the codebase

# Check if there are any issues or pull requests mentioning force delete and shard transfers
echo "Searching for related issues or PRs:"
rg -i "force delete.*shard transfer|shard transfer.*force delete" -g "*.md" -g "*.txt"

# Look for any similar test cases
echo "Searching for similar test cases:"
rg -i "force_delete|shard_transfer" -g "*/test_*.py"

Length of output: 452


Inconsistent test assertion on shard transfer behavior after force delete

The test in tests/consensus_tests/test_force_delete_peer.py currently asserts that shard transfers are aborted after a peer is force‐deleted:

  • Line 84:
    assert len(transfers) == 0

However, the PR description states that Qdrant should not abort ongoing shard transfers when the source peer is force‐deleted. Please confirm the intended behavior and update the assertion accordingly (for example, expecting len(transfers) == 1 if transfers must continue).

🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


12-12: requests may be undefined, or defined from star imports

(F405)


19-19: requests may be undefined, or defined from star imports

(F405)


24-24: assert_project_root may be undefined, or defined from star imports

(F405)


27-27: start_cluster may be undefined, or defined from star imports

(F405)


30-30: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


44-44: processes may be undefined, or defined from star imports

(F405)


52-52: start_peer may be undefined, or defined from star imports

(F405)


59-59: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


61-61: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


71-71: processes may be undefined, or defined from star imports

(F405)


72-72: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)


81-81: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

Comment on lines 68 to 73
# Stop the 'source' node to simulate an unreachable node
source_peer_url = peer_id_to_url[from_peer_id]
peer_idx = peer_api_uris.index(source_peer_url)
p = processes.pop(peer_idx)
url = peer_api_uris.pop(peer_idx)
sleep(1) # Give killed peer time to release WAL lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle the unused variable and clarify comment.

The variable url is assigned but never used, and the comment could be clearer about the purpose of stopping the source node.

-    # Stop the 'source' node to simulate an unreachable node
+    # Stop the source node of the shard transfer to simulate a failure during transfer
     source_peer_url = peer_id_to_url[from_peer_id]
     peer_idx = peer_api_uris.index(source_peer_url)
     p = processes.pop(peer_idx)
-    url = peer_api_uris.pop(peer_idx)
+    peer_api_uris.pop(peer_idx)  # Remove the URL of the killed peer from the list
     sleep(1)  # Give killed peer time to release WAL lock
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Stop the 'source' node to simulate an unreachable node
source_peer_url = peer_id_to_url[from_peer_id]
peer_idx = peer_api_uris.index(source_peer_url)
p = processes.pop(peer_idx)
url = peer_api_uris.pop(peer_idx)
sleep(1) # Give killed peer time to release WAL lock
# Stop the source node of the shard transfer to simulate a failure during transfer
source_peer_url = peer_id_to_url[from_peer_id]
peer_idx = peer_api_uris.index(source_peer_url)
p = processes.pop(peer_idx)
peer_api_uris.pop(peer_idx) # Remove the URL of the killed peer from the list
sleep(1) # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

71-71: processes may be undefined, or defined from star imports

(F405)


72-72: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)

Comment on lines 23 to 38
def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
assert_project_root()

# seed port to reuse the same port for the restarted nodes
peer_api_uris, peer_dirs, bootstrap_uri = start_cluster(tmp_path, N_PEERS, 20000)

create_collection(peer_api_uris[0], shard_number=N_SHARDS, replication_factor=N_REPLICAS)
wait_collection_exists_and_active_on_all_peers(
collection_name=COLLECTION_NAME, peer_api_uris=peer_api_uris
)

peer_id_to_url = {}
for peer_api_uri in peer_api_uris:
peer_id = get_peer_id(peer_api_uri)
peer_id_to_url[peer_id] = peer_api_uri

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add docstring to explain test purpose and steps.

The test function should include a docstring that explains its purpose and the steps it follows to test the behavior.

 def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with multiple peers
+    2. Create a collection with multiple shards and replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
     assert_project_root()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
assert_project_root()
# seed port to reuse the same port for the restarted nodes
peer_api_uris, peer_dirs, bootstrap_uri = start_cluster(tmp_path, N_PEERS, 20000)
create_collection(peer_api_uris[0], shard_number=N_SHARDS, replication_factor=N_REPLICAS)
wait_collection_exists_and_active_on_all_peers(
collection_name=COLLECTION_NAME, peer_api_uris=peer_api_uris
)
peer_id_to_url = {}
for peer_api_uri in peer_api_uris:
peer_id = get_peer_id(peer_api_uri)
peer_id_to_url[peer_id] = peer_api_uri
def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
"""
Tests that shard transfers are aborted when the source peer is force deleted.
Test steps:
1. Start a cluster with multiple peers
2. Create a collection with multiple shards and replicas
3. Insert initial data
4. Kill one peer to simulate failure
5. Insert more data to trigger recovery transfer
6. Restart the killed peer
7. Wait for shard transfer to start
8. Force delete the source peer of the transfer
9. Verify that transfers are aborted
"""
assert_project_root()
# seed port to reuse the same port for the restarted nodes
peer_api_uris, peer_dirs, bootstrap_uri = start_cluster(tmp_path, N_PEERS, 20000)
create_collection(peer_api_uris[0], shard_number=N_SHARDS, replication_factor=N_REPLICAS)
wait_collection_exists_and_active_on_all_peers(
collection_name=COLLECTION_NAME, peer_api_uris=peer_api_uris
)
peer_id_to_url = {}
for peer_api_uri in peer_api_uris:
peer_id = get_peer_id(peer_api_uri)
peer_id_to_url[peer_id] = peer_api_uri
🧰 Tools
🪛 Ruff (0.8.2)

24-24: assert_project_root may be undefined, or defined from star imports

(F405)


27-27: start_cluster may be undefined, or defined from star imports

(F405)


30-30: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
tests/consensus_tests/test_force_delete_peer.py (3)

2-2: Avoid wildcard imports.

The wildcard import (from .utils import *) makes it difficult to track which symbols are being imported and where they come from. This can lead to naming conflicts and makes the code harder to understand and maintain.

-from .utils import *
+from .utils import (assert_project_root, start_cluster, start_peer, 
+                    wait_collection_exists_and_active_on_all_peers, 
+                    wait_for_collection_shard_transfers_count, 
+                    get_collection_cluster_info, processes)
🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


23-23: Add docstring to explain test purpose and steps.

The test function should include a docstring that explains its purpose and the steps it follows to test the behavior.

def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with multiple peers
+    2. Create a collection with multiple shards and replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
    assert_project_root()

67-72: Handle the unused variable and clarify comment.

The variable url is assigned but never used, and the comment could be clearer about the purpose of stopping the source node.

-    # Stop the 'source' node to simulate an unreachable node
+    # Stop the source node of the shard transfer to simulate a failure during transfer
     source_peer_url = peer_id_to_url[from_peer_id]
     peer_idx = peer_api_uris.index(source_peer_url)
     p = processes.pop(peer_idx)
-    url = peer_api_uris.pop(peer_idx)
+    peer_api_uris.pop(peer_idx)  # Remove the URL of the killed peer from the list
     sleep(1)  # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

70-70: processes may be undefined, or defined from star imports

(F405)


71-71: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6365d82 and 5b5cbd2.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (6)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


12-12: requests may be undefined, or defined from star imports

(F405)


19-19: requests may be undefined, or defined from star imports

(F405)


24-24: assert_project_root may be undefined, or defined from star imports

(F405)


26-26: start_cluster may be undefined, or defined from star imports

(F405)


29-29: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


43-43: processes may be undefined, or defined from star imports

(F405)


51-51: start_peer may be undefined, or defined from star imports

(F405)


58-58: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


60-60: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


70-70: processes may be undefined, or defined from star imports

(F405)


71-71: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)


80-80: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
🔇 Additional comments (4)
tests/consensus_tests/test_force_delete_peer.py (4)

11-16: Force delete peer function looks good.

The function makes a DELETE request with the force flag and properly validates the response. This is the core functionality being tested.

🧰 Tools
🪛 Ruff (0.8.2)

12-12: requests may be undefined, or defined from star imports

(F405)


18-22: Peer ID retrieval function looks good.

The function correctly retrieves and parses the peer ID from the cluster API response.

🧰 Tools
🪛 Ruff (0.8.2)

19-19: requests may be undefined, or defined from star imports

(F405)


77-83: Good validation of test expectation.

The test correctly verifies that shard transfers are aborted after force deleting the source peer, which is the main behavior being tested.

🧰 Tools
🪛 Ruff (0.8.2)

80-80: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


38-56: Good simulation of peer failure and recovery scenario.

This section effectively simulates a real-world scenario where a peer fails and is later restarted, triggering shard transfers. The comments clearly explain the purpose of each step.

🧰 Tools
🪛 Ruff (0.8.2)

43-43: processes may be undefined, or defined from star imports

(F405)


51-51: start_peer may be undefined, or defined from star imports

(F405)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tests/consensus_tests/test_force_delete_peer.py (3)

1-4: Replace wildcard imports with explicit imports for better maintainability.

Using wildcard imports (from .utils import *) makes it difficult to track which symbols are being imported. This can lead to naming conflicts and makes the code harder to understand and maintain.

 import pathlib
-from .utils import *
+from .utils import (
+    assert_project_root, start_cluster, start_peer, processes,
+    wait_collection_exists_and_active_on_all_peers,
+    wait_for_collection_shard_transfers_count,
+    get_collection_cluster_info
+)
 from .fixtures import upsert_random_points, create_collection
 from time import sleep
+import requests
🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


23-23: Add docstring to explain test purpose and steps.

Adding a docstring would make the test function's purpose and methodology clearer to other developers.

 def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with multiple peers
+    2. Create a collection with multiple shards and replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
     assert_project_root()

66-70: Fix unused variable and clarify comment about stopping the source node.

The variable url is assigned but never used. Also, the comment could be more specific about why you're stopping the source node.

-    # Stop the 'source' node to simulate an unreachable node
+    # Stop the source node of the shard transfer to simulate a failure during transfer
     source_peer_url = peer_id_to_url[from_peer_id]
     peer_idx = peer_api_uris.index(source_peer_url)
     p = processes.pop(peer_idx)
-    url = peer_api_uris.pop(peer_idx)
+    peer_api_uris.pop(peer_idx)  # Remove the URL of the killed peer from the list
🧰 Tools
🪛 Ruff (0.8.2)

69-69: processes may be undefined, or defined from star imports

(F405)


70-70: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)

🧹 Nitpick comments (3)
tests/consensus_tests/test_force_delete_peer.py (3)

63-63: Improve assertion message with more context.

Add an assertion message to make it clearer what's being tested and why the last peer is expected to be the destination.

-    assert transfers[0]["to"] == list(peer_id_to_url.keys())[-1] # last peer was restarted
+    assert transfers[0]["to"] == list(peer_id_to_url.keys())[-1], "Transfer should be to the restarted peer"

75-75: Replace arbitrary sleep with a more reliable wait mechanism.

Using a fixed sleep duration can make tests flaky. Consider implementing a polling mechanism that waits for a specific condition or explain why the sleep is necessary in this case.

-    sleep(1)
+    # Wait briefly for cluster state to update after peer deletion
+    sleep(1)  # This delay is necessary to ensure cluster state is updated before checking transfers

77-80: Add a detailed assertion message for test conclusion.

Make the final assertion more informative by explaining the expected behavior.

     # We expect transfers to be aborted
     wait_for_collection_shard_transfers_count(
-        peer_api_uris[0], COLLECTION_NAME, 0
+        peer_api_uris[0], COLLECTION_NAME, 0,
+        # Verification that after force-deleting the source peer, no shard transfers remain
     )
🧰 Tools
🪛 Ruff (0.8.2)

78-78: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5cbd2 and 7fb0f24.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (6)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


12-12: requests may be undefined, or defined from star imports

(F405)


19-19: requests may be undefined, or defined from star imports

(F405)


24-24: assert_project_root may be undefined, or defined from star imports

(F405)


26-26: start_cluster may be undefined, or defined from star imports

(F405)


29-29: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


42-42: processes may be undefined, or defined from star imports

(F405)


50-50: start_peer may be undefined, or defined from star imports

(F405)


57-57: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


59-59: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


69-69: processes may be undefined, or defined from star imports

(F405)


70-70: Local variable url is assigned to but never used

Remove assignment to unused variable url

(F841)


78-78: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (ubuntu-latest)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tests/consensus_tests/test_force_delete_peer.py (1)

67-71: 🛠️ Refactor suggestion

Add a timeout to the transfer abort verification.

If the transfers are not aborted as expected, the current wait function might hang indefinitely. Consider adding a timeout parameter or checking for specific failure conditions.

     # We expect transfers to be aborted
     wait_for_collection_shard_transfers_count(
-        peer_api_uris[0], COLLECTION_NAME, 0
+        peer_api_uris[0], COLLECTION_NAME, 0, max_wait=30  # Add a timeout of 30 seconds
     )
+    
+    # Verify the peer was actually deleted
+    cluster_info = requests.get(f"{peer_api_uris[0]}/cluster").json()["result"]
+    assert from_peer_id not in [peer["id"] for peer in cluster_info["peers"]], "Source peer was not actually deleted"
🧰 Tools
🪛 Ruff (0.8.2)

68-68: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

♻️ Duplicate comments (2)
tests/consensus_tests/test_force_delete_peer.py (2)

1-5: Add explicit imports for clarity and maintainability.

The code uses wildcard imports and implicitly relies on the requests module that's imported through the star import. For better code clarity, maintainability, and to avoid potential issues if the utility module changes, use explicit imports.

import pathlib
from .utils import *
from .fixtures import upsert_random_points, create_collection
from time import sleep
+import requests

Additionally, I recommend explicitly importing the specific functions you're using from .utils rather than using a wildcard import:

-from .utils import *
+from .utils import (
+    assert_project_root, start_cluster, start_peer,
+    wait_collection_exists_and_active_on_all_peers,
+    wait_for_collection_shard_transfers_count,
+    get_collection_cluster_info, processes
+)
🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


21-29: Add a docstring to explain test purpose and steps.

The test function should include a docstring that explains its purpose and the steps it follows to test the behavior.

def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with multiple peers
+    2. Create a collection with multiple shards and replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
    assert_project_root()

    peer_api_uris, peer_dirs, bootstrap_uri = start_cluster(tmp_path, N_PEERS)
🧰 Tools
🪛 Ruff (0.8.2)

22-22: assert_project_root may be undefined, or defined from star imports

(F405)


24-24: start_cluster may be undefined, or defined from star imports

(F405)


27-27: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

🧹 Nitpick comments (4)
tests/consensus_tests/test_force_delete_peer.py (4)

9-14: Function looks good but add a docstring.

The helper function is well-implemented with proper error handling. Consider adding a docstring to explain the purpose of this function and its parameters.

def force_delete_peer(peer_api_uri: str, peer_id: int):
+    """
+    Force delete a peer from the cluster.
+    
+    Args:
+        peer_api_uri: The API URI of any active peer in the cluster
+        peer_id: The ID of the peer to be deleted
+    
+    Raises:
+        AssertionError: If the request fails
+    """
    response = requests.delete(
        f"{peer_api_uri}/cluster/peer/{peer_id}?force=true",
    )
    assert response.status_code == 200, f"Failed to force delete peer: {response.text}"
🧰 Tools
🪛 Ruff (0.8.2)

10-10: requests may be undefined, or defined from star imports

(F405)


16-20: Add a docstring to the helper function.

Similar to the previous function, adding a docstring would improve code readability and maintainability.

def get_peer_id(peer_api_uri: str) -> int:
+    """
+    Get the ID of a peer by querying its cluster information.
+    
+    Args:
+        peer_api_uri: The API URI of the peer
+        
+    Returns:
+        int: The ID of the peer
+        
+    Raises:
+        AssertionError: If the request fails
+    """
    response = requests.get(f"{peer_api_uri}/cluster")
    assert response.status_code == 200, f"Failed to get peer ID: {response.text}"
    return response.json()["result"]["peer_id"]
🧰 Tools
🪛 Ruff (0.8.2)

17-17: requests may be undefined, or defined from star imports

(F405)


39-43: Clarify the peer selection strategy.

The test kills the last peer from the processes list but there's no explicit verification that this corresponds to the last peer in peer_api_uris. A comment would help clarify this relationship.

-    # Kill last peer
+    # Kill last peer (processes and peer_api_uris are populated in the same order by start_cluster)
     p = processes.pop()
     p.kill()
     sleep(1)  # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

40-40: processes may be undefined, or defined from star imports

(F405)


54-61: Add verification of the shard transfer.

The test waits for a shard transfer to start and checks the transfer count, but it would be good to add more detailed verification of the transfer state and logging for debugging.

     # Wait for start of shard transfer
     wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 1)

     transfers = get_collection_cluster_info(peer_api_uris[0], COLLECTION_NAME)[
         "shard_transfers"
     ]
     assert len(transfers) == 1
+    print(f"Observed shard transfer: {transfers[0]}")
     assert transfers[0]["to"] == list(peer_id_to_url.keys())[-1] # last peer was restarted
+    assert "from" in transfers[0], f"Missing source peer in transfer: {transfers[0]}"
     from_peer_id = transfers[0]["from"]
🧰 Tools
🪛 Ruff (0.8.2)

55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb0f24 and b9e3a70.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (6)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


10-10: requests may be undefined, or defined from star imports

(F405)


17-17: requests may be undefined, or defined from star imports

(F405)


22-22: assert_project_root may be undefined, or defined from star imports

(F405)


24-24: start_cluster may be undefined, or defined from star imports

(F405)


27-27: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


40-40: processes may be undefined, or defined from star imports

(F405)


48-48: start_peer may be undefined, or defined from star imports

(F405)


55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


68-68: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tests/consensus_tests/test_force_delete_peer.py (3)

1-5: Explicit imports should be used instead of wildcard imports.

The code currently uses a wildcard import (from .utils import *) which makes it difficult to track which symbols are being imported. This could lead to maintainability issues and potential naming conflicts. Additionally, the code uses the requests module without explicitly importing it.

import pathlib
-from .utils import *
+from .utils import (assert_project_root, start_cluster, start_peer, 
+                    wait_collection_exists_and_active_on_all_peers, 
+                    wait_for_collection_shard_transfers_count, 
+                    get_collection_cluster_info, wait_for_same_commit,
+                    processes)
from .fixtures import upsert_random_points, create_collection
from time import sleep
+import requests
🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


21-22: Add a docstring to the test function.

A docstring would help explain the purpose of the test and the steps it takes, making it easier for other developers to understand the test case.

def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with multiple peers
+    2. Create a collection with multiple shards and replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
    assert_project_root()
🧰 Tools
🪛 Ruff (0.8.2)

22-22: assert_project_root may be undefined, or defined from star imports

(F405)


64-69: Improve the comment and handling of the source peer removal.

The comment could be clearer and the code for killing the peer could be more explicit.

-    # Stop the 'source' node to simulate an unreachable node which needs to be deleted
+    # Stop the source node of the shard transfer to simulate a failure during transfer
     source_peer_url = peer_id_to_url[from_peer_id]
     peer_idx = peer_api_uris.index(source_peer_url)
-    processes.pop(peer_idx).kill() # Kill the source peer
-    peer_api_uris.pop(peer_idx) # Remove from urls so we don't try to call it
+    source_process = processes.pop(peer_idx) 
+    source_process.kill()  # Kill the source peer
+    peer_api_uris.pop(peer_idx)  # Remove from urls so we don't try to call it
🧰 Tools
🪛 Ruff (0.8.2)

67-67: processes may be undefined, or defined from star imports

(F405)

🧹 Nitpick comments (3)
tests/consensus_tests/test_force_delete_peer.py (3)

39-42: Add a more robust way to identify the peer to kill.

The current implementation simply kills the last peer in the processes list, which might be brittle if the order of processes changes. Consider using a more explicit way to identify which peer to kill.

-    # Kill last peer
-    p = processes.pop()
-    p.kill()
+    # Kill the last peer to simulate a node failure
+    last_peer_process = processes.pop()
+    last_peer_process.kill()
     sleep(1)  # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

40-40: processes may be undefined, or defined from star imports

(F405)


57-63: Verify the data transfer properties more thoroughly.

The assertions are good, but could be more comprehensive to ensure the transfer is happening as expected. Consider adding assertions for other properties of the transfer.

    transfers = get_collection_cluster_info(peer_api_uris[0], COLLECTION_NAME)[
        "shard_transfers"
    ]
    assert len(transfers) == 1
+    # Verify that transfer is going to the restarted peer
    assert transfers[0]["to"] == list(peer_id_to_url.keys())[-1] # last peer was restarted
+    # Verify that transfer is happening from a valid source peer
    from_peer_id = transfers[0]["from"]
+    assert from_peer_id in peer_id_to_url, f"Invalid source peer ID: {from_peer_id}"
🧰 Tools
🪛 Ruff (0.8.2)

57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


75-78: Add an assertion to verify cluster stability after force deleting a peer.

After verifying that transfers are aborted, it would be valuable to ensure the cluster remains stable and operational.

    # We expect transfers to be aborted
    wait_for_collection_shard_transfers_count(
        peer_api_uris[0], COLLECTION_NAME, 0
    )
+    
+    # Verify that the cluster remains stable after force deleting a peer
+    # Try to insert some more points to ensure the cluster is still operational
+    upsert_random_points(peer_api_uris[0], 50)
+    
+    # Verify that all remaining peers have the same view of the cluster
+    wait_for_same_commit(peer_api_uris)
🧰 Tools
🪛 Ruff (0.8.2)

76-76: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9e3a70 and 4b54820.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (7)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
  • wait_for_same_commit (592-597)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


10-10: requests may be undefined, or defined from star imports

(F405)


17-17: requests may be undefined, or defined from star imports

(F405)


22-22: assert_project_root may be undefined, or defined from star imports

(F405)


24-24: start_cluster may be undefined, or defined from star imports

(F405)


27-27: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


40-40: processes may be undefined, or defined from star imports

(F405)


48-48: start_peer may be undefined, or defined from star imports

(F405)


55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


67-67: processes may be undefined, or defined from star imports

(F405)


70-70: wait_for_same_commit may be undefined, or defined from star imports

(F405)


76-76: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-low-resources
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (4)
tests/consensus_tests/test_force_delete_peer.py (4)

9-14: LGTM: Helper function for force deleting a peer.

The function is well-defined with proper type hints and error handling. It's a good abstraction for the test.

🧰 Tools
🪛 Ruff (0.8.2)

10-10: requests may be undefined, or defined from star imports

(F405)


16-19: LGTM: Helper function for retrieving peer ID.

This utility function provides a clean abstraction for getting a peer's ID with proper error handling and type annotations.

🧰 Tools
🪛 Ruff (0.8.2)

17-17: requests may be undefined, or defined from star imports

(F405)


54-56: LGTM: Waiting for shard transfer initialization.

Good usage of the wait_for_collection_shard_transfers_count utility function to ensure the test waits for the expected condition before proceeding.

🧰 Tools
🪛 Ruff (0.8.2)

55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


72-73: LGTM: Force deleting the source peer.

Correctly uses the helper function to force delete the source peer of the shard transfer.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/consensus_tests/test_force_delete_peer.py (2)

1-5: 🛠️ Refactor suggestion

Clean up imports to improve code maintainability.

The code uses wildcard imports (from .utils import *) which makes it difficult to track which symbols are imported. It also uses the requests module without explicitly importing it.

import pathlib
-from .utils import *
+from .utils import (assert_project_root, start_cluster, start_peer, 
+                    wait_collection_exists_and_active_on_all_peers, 
+                    wait_for_collection_shard_transfers_count, 
+                    get_collection_cluster_info, processes)
from .fixtures import upsert_random_points, create_collection
from time import sleep
+import requests
🧰 Tools
🪛 Ruff (0.8.2)

2-2: from .utils import * used; unable to detect undefined names

(F403)


21-22: 🛠️ Refactor suggestion

Add docstring to explain test purpose and steps.

The test function should include a docstring that explains its purpose and the steps it follows to test the behavior.

def test_force_delete_source_peer_during_transfers(tmp_path: pathlib.Path):
+    """
+    Tests that shard transfers are aborted when the source peer is force deleted.
+    
+    Test steps:
+    1. Start a cluster with 3 peers
+    2. Create a collection with 2 shards and 3 replicas
+    3. Insert initial data
+    4. Kill one peer to simulate failure
+    5. Insert more data to trigger recovery transfer
+    6. Restart the killed peer
+    7. Wait for shard transfer to start
+    8. Force delete the source peer of the transfer
+    9. Verify that transfers are aborted
+    """
    assert_project_root()
🧰 Tools
🪛 Ruff (0.8.2)

22-22: assert_project_root may be undefined, or defined from star imports

(F405)

🧹 Nitpick comments (5)
tests/consensus_tests/test_force_delete_peer.py (5)

9-14: Function looks good, but add docstring.

The function to force delete a peer is well-implemented, but missing a docstring to explain its purpose and parameters.

def force_delete_peer(peer_api_uri: str, peer_id: int):
+    """
+    Force delete a peer from the cluster with the specified ID.
+    
+    Args:
+        peer_api_uri: URI of any active peer in the cluster
+        peer_id: ID of the peer to be forcibly deleted
+    """
    response = requests.delete(
        f"{peer_api_uri}/cluster/peer/{peer_id}?force=true",
    )
    assert response.status_code == 200, f"Failed to force delete peer: {response.text}"
🧰 Tools
🪛 Ruff (0.8.2)

10-10: requests may be undefined, or defined from star imports

(F405)


16-20: Function looks good, but add docstring.

The function to get peer ID is well-implemented, but missing a docstring to explain its purpose and parameters.

def get_peer_id(peer_api_uri: str) -> int:
+    """
+    Get the peer ID for a peer given its API URI.
+    
+    Args:
+        peer_api_uri: URI of the peer
+        
+    Returns:
+        The peer ID as an integer
+    """
    response = requests.get(f"{peer_api_uri}/cluster")
    assert response.status_code == 200, f"Failed to get peer ID: {response.text}"
    return response.json()["result"]["peer_id"]
🧰 Tools
🪛 Ruff (0.8.2)

17-17: requests may be undefined, or defined from star imports

(F405)


39-42: Improve comment clarity for killed peer.

The comment "Kill last peer" could be clearer about which peer is being killed and why.

-    # Kill last peer
+    # Kill the last peer to simulate a node failure
     p = processes.pop()
     p.kill()
     sleep(1)  # Give killed peer time to release WAL lock
🧰 Tools
🪛 Ruff (0.8.2)

40-40: processes may be undefined, or defined from star imports

(F405)


44-45: Clarify why upserting points triggers recovery.

The comment mentions marking a dummy replica as dead, but it would be helpful to explain how upserting points accomplishes this.

-    # Upsert points to mark dummy replica as dead, that will trigger recovery transfer
+    # Upsert points - since one peer is down, these updates won't reach it
+    # This will mark its replicas as stale, triggering recovery transfer when it rejoins
     upsert_random_points(peer_api_uris[0], 100)

54-62: Good verification of transfer target, consider verifying shard ID.

The code correctly waits for and verifies the shard transfer, but it might be useful to capture and verify the specific shard ID being transferred.

    # Wait for start of shard transfer
    wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 1)

    transfers = get_collection_cluster_info(peer_api_uris[0], COLLECTION_NAME)[
        "shard_transfers"
    ]
    assert len(transfers) == 1
    assert transfers[0]["to"] == list(peer_id_to_url.keys())[-1] # last peer was restarted
    from_peer_id = transfers[0]["from"]
+    shard_id = transfers[0]["shard_id"]  # Capture shard ID for debugging/verification
🧰 Tools
🪛 Ruff (0.8.2)

55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b54820 and 4fb0b5c.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_force_delete_peer.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/consensus_tests/test_force_delete_peer.py (2)
tests/consensus_tests/fixtures.py (1)
  • upsert_random_points (83-129)
tests/consensus_tests/utils.py (6)
  • assert_project_root (110-113)
  • start_cluster (220-245)
  • wait_collection_exists_and_active_on_all_peers (753-757)
  • start_peer (147-180)
  • wait_for_collection_shard_transfers_count (616-622)
  • get_collection_cluster_info (294-298)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_force_delete_peer.py

2-2: from .utils import * used; unable to detect undefined names

(F403)


10-10: requests may be undefined, or defined from star imports

(F405)


17-17: requests may be undefined, or defined from star imports

(F405)


22-22: assert_project_root may be undefined, or defined from star imports

(F405)


24-24: start_cluster may be undefined, or defined from star imports

(F405)


27-27: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


40-40: processes may be undefined, or defined from star imports

(F405)


48-48: start_peer may be undefined, or defined from star imports

(F405)


55-55: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


57-57: get_collection_cluster_info may be undefined, or defined from star imports

(F405)


73-73: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
🔇 Additional comments (4)
tests/consensus_tests/test_force_delete_peer.py (4)

31-35: LGTM! Useful mapping between peer IDs and URLs.

The code builds a mapping between peer IDs and their API URLs, which is essential for later operations in the test.


47-53: LGTM! Good approach for restarting the peer.

The code correctly restarts the previously killed peer, which should trigger shard transfers to synchronize the data it missed while being down.

🧰 Tools
🪛 Ruff (0.8.2)

48-48: start_peer may be undefined, or defined from star imports

(F405)


69-71: LGTM! Force deleting the peer is the key test action.

This is the core action being tested - forcing deletion of a peer while it's the source of an ongoing shard transfer.


72-75: LGTM! Good verification of expected behavior.

The test correctly verifies that shard transfers are aborted after force deleting the source peer, which is the behavior this test aims to verify.

🧰 Tools
🪛 Ruff (0.8.2)

73-73: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)

* Abort transfer on force peer delete

* ToC changes to abort internally

* Abort shard transfer using consensus not just locally

* fmt

* fix fmt

* Abort transfers internally when shard is being removed

* remove unused imports
@KShivendu KShivendu changed the title Test for reproducing force delete peer bug Abort transfer on force peer delete and add test for bug May 15, 2025
@KShivendu KShivendu merged commit db51467 into dev May 15, 2025
17 checks passed
@KShivendu KShivendu deleted the force-delete-peer-test branch May 15, 2025 09:59
@coderabbitai coderabbitai bot mentioned this pull request May 21, 2025
3 tasks
generall pushed a commit that referenced this pull request May 22, 2025
* Test for reproducing force delete peer bug

* drop seed port

* use larger transfers to ensure killing before transfer is completed

* simplify test cases

* kill and wait for same commit before force delete

* Don't kill to avoid breaking consesnsus before force delete peer

* Abort transfer on force peer delete (#6507)

* Abort transfer on force peer delete

* ToC changes to abort internally

* Abort shard transfer using consensus not just locally

* fmt

* fix fmt

* Abort transfers internally when shard is being removed

* remove unused imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants