Skip to content

Conversation

KShivendu
Copy link
Member

@KShivendu KShivendu commented Apr 1, 2025

Add a new integration test that ensures that Qdrant is able to load a dirty shard (shard initializing flag still present). This test is expected to fail currently because Qdrant isn't doing the right thing. There will be another PR on top for this that fixes the test failure.

It currently loads a dummy shard and stays like that (since we have handle_collection_load_errors: true in our development.yaml). Ideally it should initiate a transfer and get the local shard fixed.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Copy link
Contributor

coderabbitai bot commented Apr 1, 2025

📝 Walkthrough

Walkthrough

The pull request introduces a new function, corrupt_shard_dir, in the consensus tests to simulate corruption in a shard directory by deleting specific directories within a shard path. It also adds a new test function, test_dirty_shard_handling_with_active_replicas, which is parameterized to evaluate different shard transfer methods: "snapshot", "stream_records", and "wal_delta". This test sets up a cluster, creates a collection, and populates it with data, then simulates a dirty shard scenario and verifies the recovery process for local and remote shards. Additionally, the existing test_failed_snapshot_recovery function has been updated to include a newline for formatting consistency. In the utils module, the check_all_replicas_active function has been modified to include error handling for connection issues, returning False in case of a requests.exceptions.ConnectionError.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)

218-296: Mark test as expected failure.

The test test_dirty_shard_handling_with_active_replicas is well-structured and documents the current issue with Qdrant's handling of dirty shards. However, your comment on line 273-276 indicates that this test is expected to fail, which aligns with the PR objective of illustrating the missing functionality.

Consider marking this test as an expected failure to reflect its current status:

+@pytest.mark.xfail(reason="Qdrant loads a dummy shard instead of initiating a transfer to fix the local shard")
 @pytest.mark.parametrize("transfer_method", ["snapshot", "stream_records", "wal_delta"])
 def test_dirty_shard_handling_with_active_replicas(tmp_path: pathlib.Path, transfer_method: str):

This way, the test will run but won't cause the test suite to fail, and the expectation is clearly documented.

🧰 Tools
🪛 Ruff (0.8.2)

218-218: pytest may be undefined, or defined from star imports

(F405)


220-220: assert_project_root may be undefined, or defined from star imports

(F405)


222-222: start_cluster may be undefined, or defined from star imports

(F405)


231-231: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


235-235: wait_for_same_commit may be undefined, or defined from star imports

(F405)


251-251: Path may be undefined, or defined from star imports

(F405)


254-254: processes may be undefined, or defined from star imports

(F405)


258-258: start_peer may be undefined, or defined from star imports

(F405)


263-263: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


266-266: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)


229-233: Consider extracting common setup into a fixture.

This setup code is very similar to the one in test_failed_snapshot_recovery. If more tests like this are expected in the future, consider extracting the common setup logic into a pytest fixture for reusability.

🧰 Tools
🪛 Ruff (0.8.2)

231-231: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33d062a and 9b1c82b.

📒 Files selected for processing (2)
  • tests/consensus_tests/test_failed_snapshot_recovery.py (1 hunks)
  • tests/consensus_tests/utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py

218-218: pytest may be undefined, or defined from star imports

(F405)


220-220: assert_project_root may be undefined, or defined from star imports

(F405)


222-222: start_cluster may be undefined, or defined from star imports

(F405)


231-231: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


235-235: wait_for_same_commit may be undefined, or defined from star imports

(F405)


251-251: Path may be undefined, or defined from star imports

(F405)


254-254: processes may be undefined, or defined from star imports

(F405)


258-258: start_peer may be undefined, or defined from star imports

(F405)


263-263: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


266-266: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (macos-latest)
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (2)
tests/consensus_tests/utils.py (1)

477-488: Good improvement in error handling.

The addition of a try-except block to catch ConnectionError in check_all_replicas_active is a sensible enhancement. This makes the function more robust when network issues occur during testing, especially for scenarios where a peer might be temporarily unavailable during restart.

🧰 Tools
🪛 Ruff (0.8.2)

477-477: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

tests/consensus_tests/test_failed_snapshot_recovery.py (1)

247-252: Good documentation of edge case scenario.

The test correctly simulates a real-world scenario where a node might be killed during shard initialization. This kind of edge case testing is valuable for improving system robustness.

🧰 Tools
🪛 Ruff (0.8.2)

251-251: Path may be undefined, or defined from star imports

(F405)

@generall
Copy link
Member

generall commented Apr 1, 2025

I assume the CI failure is intended?

@KShivendu
Copy link
Member Author

I assume the CI failure is intended?

Yes.

@KShivendu KShivendu force-pushed the test-shard-initializing-behaviour branch from 9b1c82b to c9a638f Compare April 3, 2025 09:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)

273-277: 🛠️ Refactor suggestion

Expected test failure needs clarification

The test contains an assertion that is explicitly expected to fail. While this aligns with the PR objective, having intentionally failing tests in the codebase may cause confusion:

  1. Consider adding pytest.mark.xfail to explicitly mark this test as expected to fail
  2. Add a tracking issue number in the comment to indicate when this should be fixed
  3. Consider explicitly disabling handle_collection_load_errors flag to better mimic a real scenario
- assert local_shards[0]["points_count"] == 1000  # FAILS because of dummy shard
-
- # Qdrant loads a dummy shard (since handle_collection_load_errors=true) so we get 0 points instead of 1000 and test fails
- # Ideally Qdrant should try to recover this shard automatically using other replicas
+ # FIXME: This currently fails because Qdrant loads a dummy shard instead of properly recovering
+ # Issue: <add-issue-number-here>
+ pytest.xfail("Qdrant currently loads a dummy shard when handle_collection_load_errors=true")
+ assert local_shards[0]["points_count"] == 1000
🧹 Nitpick comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)

222-226: Consider explicitly setting handle_collection_load_errors

The test relies on the handle_collection_load_errors setting being true (from development.yaml), but doesn't explicitly set it. To better control the test environment and avoid dependency on configuration files, consider explicitly setting this value in the test.

  peer_api_uris, peer_dirs, bootstrap_uri = start_cluster(
      tmp_path,
      N_PEERS,
-     extra_env={"QDRANT__STORAGE__SHARD_TRANSFER_METHOD": transfer_method },
+     extra_env={
+         "QDRANT__STORAGE__SHARD_TRANSFER_METHOD": transfer_method,
+         "QDRANT__STORAGE__HANDLE_COLLECTION_LOAD_ERRORS": "true"
+     },
  )
🧰 Tools
🪛 Ruff (0.8.2)

222-222: start_cluster may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b1c82b and c9a638f.

📒 Files selected for processing (2)
  • tests/consensus_tests/test_failed_snapshot_recovery.py (1 hunks)
  • tests/consensus_tests/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/consensus_tests/utils.py
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py

218-218: pytest may be undefined, or defined from star imports

(F405)


220-220: assert_project_root may be undefined, or defined from star imports

(F405)


222-222: start_cluster may be undefined, or defined from star imports

(F405)


231-231: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


235-235: wait_for_same_commit may be undefined, or defined from star imports

(F405)


251-251: Path may be undefined, or defined from star imports

(F405)


254-254: processes may be undefined, or defined from star imports

(F405)


258-258: start_peer may be undefined, or defined from star imports

(F405)


263-263: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


266-266: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (5)
tests/consensus_tests/test_failed_snapshot_recovery.py (5)

215-216: LGTM - added newline for better readability

The addition of a newline improves code readability and follows proper Python style guidelines.


218-226: Parametrized test approach looks good

The test is well-parametrized to cover all three shard transfer methods ("snapshot", "stream_records", "wal_delta"), which provides comprehensive coverage of different scenarios.

🧰 Tools
🪛 Ruff (0.8.2)

218-218: pytest may be undefined, or defined from star imports

(F405)


220-220: assert_project_root may be undefined, or defined from star imports

(F405)


222-222: start_cluster may be undefined, or defined from star imports

(F405)


247-252: Consider testing additional failure scenarios

The current test simulates a dirty shard by creating an initialization flag, which is a good start. However, as suggested in previous comments, we should also consider testing more scenarios like corrupting files on disk or completely removing local shard data.

🧰 Tools
🪛 Ruff (0.8.2)

251-251: Path may be undefined, or defined from star imports

(F405)


262-267: Improve shard transfer detection

The current approach may miss transfers that start and complete quickly. Consider implementing the two-step waiting approach as suggested in previous comments.

- # Wait for end of shard transfer
- wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 0)
+ # Wait for transfer to start and then complete
+ wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 1)
+ wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 0)
🧰 Tools
🪛 Ruff (0.8.2)

263-263: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


266-266: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)


292-295: LGTM - Thorough verification of search results

The test properly verifies that search results remain consistent after the recovery process, which is a crucial aspect of ensuring data integrity.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/consensus_tests/test_failed_snapshot_recovery.py (4)

79-88: Fix variable name typo and consider existence checks when removing directories.

In corrupt_shard_dir(), the variable segements_path is misspelled and might cause confusion. Additionally, consider adding existence checks or exception handling before removing directories to avoid potential errors if they do not exist.

Example fix:

-    segements_path = shard_path / "segments"
+    segments_path = shard_path / "segments"

229-229: Explicitly import pytest if not defined in star imports.

The decorator @pytest.mark.parametrize might trigger warnings from static analysis tools about undefined names under star imports. Consider adding import pytest explicitly to improve clarity and satisfy lint checks.

🧰 Tools
🪛 Ruff (0.8.2)

229-229: pytest may be undefined, or defined from star imports

(F405)


275-278: Confirm that the “processes” reference is defined.

The reference to processes.pop() implies that processes originates from a star import and may not be obvious to new contributors or static analysis tools. Consider clarifying its definition or explicitly importing it to improve transparency.

🧰 Tools
🪛 Ruff (0.8.2)

275-275: processes may be undefined, or defined from star imports

(F405)


288-321: Monitor for potential test flakiness in shard transfer waits.

The calls to wait_for_collection_shard_transfers_count rely on transfer timings. Under certain conditions or slower environments, these waits might time out, leading to intermittent failures. Consider introducing a slightly larger timeout or additional waiting logic to ensure stability across environments.

🧰 Tools
🪛 Ruff (0.8.2)

288-288: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


291-291: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


294-294: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9a638f and 9f90e15.

📒 Files selected for processing (1)
  • tests/consensus_tests/test_failed_snapshot_recovery.py (3 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (4)
lib/collection/src/shards/local_shard/mod.rs (2)
  • shard_path (461-463)
  • wal_path (465-467)
lib/collection/src/shards/mod.rs (1)
  • shard_path (40-42)
tests/consensus_tests/utils.py (7)
  • assert_project_root (90-93)
  • start_cluster (189-214)
  • wait_collection_exists_and_active_on_all_peers (710-714)
  • wait_for_same_commit (549-554)
  • start_peer (116-149)
  • wait_for_collection_shard_transfers_count (573-579)
  • wait_for_all_replicas_active (533-538)
lib/collection/src/shards/shard_holder/mod.rs (2)
  • len (583-585)
  • get_local_shards (764-772)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py

229-229: pytest may be undefined, or defined from star imports

(F405)


231-231: assert_project_root may be undefined, or defined from star imports

(F405)


237-237: start_cluster may be undefined, or defined from star imports

(F405)


246-246: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


250-250: wait_for_same_commit may be undefined, or defined from star imports

(F405)


267-267: Path may be undefined, or defined from star imports

(F405)


270-270: Path may be undefined, or defined from star imports

(F405)


275-275: processes may be undefined, or defined from star imports

(F405)


279-279: start_peer may be undefined, or defined from star imports

(F405)


288-288: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


291-291: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


294-294: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (2)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)

226-227: No concerns identified.

The additional assertion lines here are straightforward and align with the desired outcome.


230-273: Verify whether shard corruption is intentionally skipped.

The line corrupt_shard_dir(shard_path) is commented out. Since you’ve defined this utility, please confirm if you intend to skip shard directory corruption for this test. If not, uncomment the call or remove the function if it is no longer needed.

🧰 Tools
🪛 Ruff (0.8.2)

231-231: assert_project_root may be undefined, or defined from star imports

(F405)


237-237: start_cluster may be undefined, or defined from star imports

(F405)


246-246: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


250-250: wait_for_same_commit may be undefined, or defined from star imports

(F405)


267-267: Path may be undefined, or defined from star imports

(F405)


270-270: Path may be undefined, or defined from star imports

(F405)

@KShivendu KShivendu force-pushed the test-shard-initializing-behaviour branch from 9f90e15 to a5867bc Compare April 8, 2025 06:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)

79-87: Fix typo in variable name

There's a typo in the variable name segements_path which should be segments_path. Variable names should be consistent with their usage throughout the code.

def corrupt_shard_dir(shard_path: pathlib.Path):
-    segements_path = shard_path / "segments"
-    segment_names = filter(lambda x: x.is_dir(), pathlib.Path(segements_path).iterdir())
+    segments_path = shard_path / "segments"
+    segment_names = filter(lambda x: x.is_dir(), pathlib.Path(segments_path).iterdir())

    first_segment_path = next(segment_names)
    shutil.rmtree(first_segment_path)

    wal_path = shard_path / "wal"
    shutil.rmtree(wal_path)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f90e15 and a5867bc.

📒 Files selected for processing (2)
  • tests/consensus_tests/test_failed_snapshot_recovery.py (3 hunks)
  • tests/consensus_tests/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/consensus_tests/utils.py
🧰 Additional context used
🧬 Code Definitions (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)
tests/consensus_tests/utils.py (7)
  • assert_project_root (90-93)
  • start_cluster (189-214)
  • wait_collection_exists_and_active_on_all_peers (710-714)
  • wait_for_same_commit (549-554)
  • kill (33-39)
  • start_peer (116-149)
  • wait_for_all_replicas_active (533-538)
tests/consensus_tests/fixtures.py (2)
  • upsert_random_points (83-129)
  • random_dense_vector (19-20)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py

229-229: pytest may be undefined, or defined from star imports

(F405)


231-231: assert_project_root may be undefined, or defined from star imports

(F405)


237-237: start_cluster may be undefined, or defined from star imports

(F405)


246-246: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)


250-250: wait_for_same_commit may be undefined, or defined from star imports

(F405)


267-267: Path may be undefined, or defined from star imports

(F405)


270-270: Path may be undefined, or defined from star imports

(F405)


275-275: processes may be undefined, or defined from star imports

(F405)


279-279: start_peer may be undefined, or defined from star imports

(F405)


288-288: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


291-291: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


294-294: wait_for_all_replicas_active may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: test-low-resources
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: lint
🔇 Additional comments (5)
tests/consensus_tests/test_failed_snapshot_recovery.py (5)

272-272: Commented-out function call

The corrupt_shard_dir function is implemented but not used as it's commented out. This raises questions about the test coverage. Is the corruption intentionally disabled?

Based on previous review comments, it appears this function was added as a suggestion, but the logic primarily depends on the shard initializing flag, which explains why it might be commented out.

If the corruption is intentionally not applied, consider adding a comment explaining why or removing the commented-out code.


229-235: LGTM! Comprehensive test parameters and configuration

The test is well-designed with different transfer methods and explicitly disables error handling to better simulate production scenarios. This aligns with the previous review comment where it was suggested to explicitly disable the handle_collection_load_errors flag.

🧰 Tools
🪛 Ruff (0.8.2)

229-229: pytest may be undefined, or defined from star imports

(F405)


231-231: assert_project_root may be undefined, or defined from star imports

(F405)


265-268: Simulation strategy looks good

The approach of creating the initializing flag to simulate a dirty shard is a good testing strategy. This simulates Qdrant being killed during shard directory movement, which can happen in production environments.

🧰 Tools
🪛 Ruff (0.8.2)

267-267: Path may be undefined, or defined from star imports

(F405)


288-291: Verify transfer completion sequence is correct

The code has been updated according to the previous review comment to first wait for a transfer to start and then wait for it to complete. This is the right sequence as transfers might not start instantly.

🧰 Tools
🪛 Ruff (0.8.2)

288-288: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


291-291: wait_for_collection_shard_transfers_count may be undefined, or defined from star imports

(F405)


316-320: Good search verification

The test properly verifies that search results remain consistent before and after recovery, which is an important validation for data integrity after shard recovery.

@KShivendu
Copy link
Member Author

The same commits are now part of #6293. So closing this now.

@KShivendu KShivendu closed this Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants