-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Test behaviour of Qdrant with shard initializing flag #6292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe pull request introduces a new function, ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)
218-296
: Mark test as expected failure.The test
test_dirty_shard_handling_with_active_replicas
is well-structured and documents the current issue with Qdrant's handling of dirty shards. However, your comment on line 273-276 indicates that this test is expected to fail, which aligns with the PR objective of illustrating the missing functionality.Consider marking this test as an expected failure to reflect its current status:
+@pytest.mark.xfail(reason="Qdrant loads a dummy shard instead of initiating a transfer to fix the local shard") @pytest.mark.parametrize("transfer_method", ["snapshot", "stream_records", "wal_delta"]) def test_dirty_shard_handling_with_active_replicas(tmp_path: pathlib.Path, transfer_method: str):
This way, the test will run but won't cause the test suite to fail, and the expectation is clearly documented.
🧰 Tools
🪛 Ruff (0.8.2)
218-218:
pytest
may be undefined, or defined from star imports(F405)
220-220:
assert_project_root
may be undefined, or defined from star imports(F405)
222-222:
start_cluster
may be undefined, or defined from star imports(F405)
231-231:
wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports(F405)
235-235:
wait_for_same_commit
may be undefined, or defined from star imports(F405)
251-251:
Path
may be undefined, or defined from star imports(F405)
254-254:
processes
may be undefined, or defined from star imports(F405)
258-258:
start_peer
may be undefined, or defined from star imports(F405)
263-263:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
266-266:
wait_for_all_replicas_active
may be undefined, or defined from star imports(F405)
229-233
: Consider extracting common setup into a fixture.This setup code is very similar to the one in
test_failed_snapshot_recovery
. If more tests like this are expected in the future, consider extracting the common setup logic into a pytest fixture for reusability.🧰 Tools
🪛 Ruff (0.8.2)
231-231:
wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports(F405)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/consensus_tests/test_failed_snapshot_recovery.py
(1 hunks)tests/consensus_tests/utils.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py
218-218: pytest
may be undefined, or defined from star imports
(F405)
220-220: assert_project_root
may be undefined, or defined from star imports
(F405)
222-222: start_cluster
may be undefined, or defined from star imports
(F405)
231-231: wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports
(F405)
235-235: wait_for_same_commit
may be undefined, or defined from star imports
(F405)
251-251: Path
may be undefined, or defined from star imports
(F405)
254-254: processes
may be undefined, or defined from star imports
(F405)
258-258: start_peer
may be undefined, or defined from star imports
(F405)
263-263: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
266-266: wait_for_all_replicas_active
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (2)
tests/consensus_tests/utils.py (1)
477-488
: Good improvement in error handling.The addition of a try-except block to catch
ConnectionError
incheck_all_replicas_active
is a sensible enhancement. This makes the function more robust when network issues occur during testing, especially for scenarios where a peer might be temporarily unavailable during restart.🧰 Tools
🪛 Ruff (0.8.2)
477-477: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)
247-252
: Good documentation of edge case scenario.The test correctly simulates a real-world scenario where a node might be killed during shard initialization. This kind of edge case testing is valuable for improving system robustness.
🧰 Tools
🪛 Ruff (0.8.2)
251-251:
Path
may be undefined, or defined from star imports(F405)
I assume the CI failure is intended? |
Yes. |
9b1c82b
to
c9a638f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)
273-277
: 🛠️ Refactor suggestionExpected test failure needs clarification
The test contains an assertion that is explicitly expected to fail. While this aligns with the PR objective, having intentionally failing tests in the codebase may cause confusion:
- Consider adding
pytest.mark.xfail
to explicitly mark this test as expected to fail- Add a tracking issue number in the comment to indicate when this should be fixed
- Consider explicitly disabling
handle_collection_load_errors
flag to better mimic a real scenario- assert local_shards[0]["points_count"] == 1000 # FAILS because of dummy shard - - # Qdrant loads a dummy shard (since handle_collection_load_errors=true) so we get 0 points instead of 1000 and test fails - # Ideally Qdrant should try to recover this shard automatically using other replicas + # FIXME: This currently fails because Qdrant loads a dummy shard instead of properly recovering + # Issue: <add-issue-number-here> + pytest.xfail("Qdrant currently loads a dummy shard when handle_collection_load_errors=true") + assert local_shards[0]["points_count"] == 1000
🧹 Nitpick comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)
222-226
: Consider explicitly setting handle_collection_load_errorsThe test relies on the
handle_collection_load_errors
setting being true (from development.yaml), but doesn't explicitly set it. To better control the test environment and avoid dependency on configuration files, consider explicitly setting this value in the test.peer_api_uris, peer_dirs, bootstrap_uri = start_cluster( tmp_path, N_PEERS, - extra_env={"QDRANT__STORAGE__SHARD_TRANSFER_METHOD": transfer_method }, + extra_env={ + "QDRANT__STORAGE__SHARD_TRANSFER_METHOD": transfer_method, + "QDRANT__STORAGE__HANDLE_COLLECTION_LOAD_ERRORS": "true" + }, )🧰 Tools
🪛 Ruff (0.8.2)
222-222:
start_cluster
may be undefined, or defined from star imports(F405)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/consensus_tests/test_failed_snapshot_recovery.py
(1 hunks)tests/consensus_tests/utils.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/consensus_tests/utils.py
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py
218-218: pytest
may be undefined, or defined from star imports
(F405)
220-220: assert_project_root
may be undefined, or defined from star imports
(F405)
222-222: start_cluster
may be undefined, or defined from star imports
(F405)
231-231: wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports
(F405)
235-235: wait_for_same_commit
may be undefined, or defined from star imports
(F405)
251-251: Path
may be undefined, or defined from star imports
(F405)
254-254: processes
may be undefined, or defined from star imports
(F405)
258-258: start_peer
may be undefined, or defined from star imports
(F405)
263-263: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
266-266: wait_for_all_replicas_active
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
🔇 Additional comments (5)
tests/consensus_tests/test_failed_snapshot_recovery.py (5)
215-216
: LGTM - added newline for better readabilityThe addition of a newline improves code readability and follows proper Python style guidelines.
218-226
: Parametrized test approach looks goodThe test is well-parametrized to cover all three shard transfer methods ("snapshot", "stream_records", "wal_delta"), which provides comprehensive coverage of different scenarios.
🧰 Tools
🪛 Ruff (0.8.2)
218-218:
pytest
may be undefined, or defined from star imports(F405)
220-220:
assert_project_root
may be undefined, or defined from star imports(F405)
222-222:
start_cluster
may be undefined, or defined from star imports(F405)
247-252
: Consider testing additional failure scenariosThe current test simulates a dirty shard by creating an initialization flag, which is a good start. However, as suggested in previous comments, we should also consider testing more scenarios like corrupting files on disk or completely removing local shard data.
🧰 Tools
🪛 Ruff (0.8.2)
251-251:
Path
may be undefined, or defined from star imports(F405)
262-267
: Improve shard transfer detectionThe current approach may miss transfers that start and complete quickly. Consider implementing the two-step waiting approach as suggested in previous comments.
- # Wait for end of shard transfer - wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 0) + # Wait for transfer to start and then complete + wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 1) + wait_for_collection_shard_transfers_count(peer_api_uris[0], COLLECTION_NAME, 0)🧰 Tools
🪛 Ruff (0.8.2)
263-263:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
266-266:
wait_for_all_replicas_active
may be undefined, or defined from star imports(F405)
292-295
: LGTM - Thorough verification of search resultsThe test properly verifies that search results remain consistent after the recovery process, which is a crucial aspect of ensuring data integrity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/consensus_tests/test_failed_snapshot_recovery.py (4)
79-88
: Fix variable name typo and consider existence checks when removing directories.In
corrupt_shard_dir()
, the variablesegements_path
is misspelled and might cause confusion. Additionally, consider adding existence checks or exception handling before removing directories to avoid potential errors if they do not exist.Example fix:
- segements_path = shard_path / "segments" + segments_path = shard_path / "segments"
229-229
: Explicitly import pytest if not defined in star imports.The decorator
@pytest.mark.parametrize
might trigger warnings from static analysis tools about undefined names under star imports. Consider addingimport pytest
explicitly to improve clarity and satisfy lint checks.🧰 Tools
🪛 Ruff (0.8.2)
229-229:
pytest
may be undefined, or defined from star imports(F405)
275-278
: Confirm that the “processes” reference is defined.The reference to
processes.pop()
implies thatprocesses
originates from a star import and may not be obvious to new contributors or static analysis tools. Consider clarifying its definition or explicitly importing it to improve transparency.🧰 Tools
🪛 Ruff (0.8.2)
275-275:
processes
may be undefined, or defined from star imports(F405)
288-321
: Monitor for potential test flakiness in shard transfer waits.The calls to
wait_for_collection_shard_transfers_count
rely on transfer timings. Under certain conditions or slower environments, these waits might time out, leading to intermittent failures. Consider introducing a slightly larger timeout or additional waiting logic to ensure stability across environments.🧰 Tools
🪛 Ruff (0.8.2)
288-288:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
291-291:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
294-294:
wait_for_all_replicas_active
may be undefined, or defined from star imports(F405)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/consensus_tests/test_failed_snapshot_recovery.py
(3 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (4)
lib/collection/src/shards/local_shard/mod.rs (2)
shard_path
(461-463)wal_path
(465-467)lib/collection/src/shards/mod.rs (1)
shard_path
(40-42)tests/consensus_tests/utils.py (7)
assert_project_root
(90-93)start_cluster
(189-214)wait_collection_exists_and_active_on_all_peers
(710-714)wait_for_same_commit
(549-554)start_peer
(116-149)wait_for_collection_shard_transfers_count
(573-579)wait_for_all_replicas_active
(533-538)lib/collection/src/shards/shard_holder/mod.rs (2)
len
(583-585)get_local_shards
(764-772)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py
229-229: pytest
may be undefined, or defined from star imports
(F405)
231-231: assert_project_root
may be undefined, or defined from star imports
(F405)
237-237: start_cluster
may be undefined, or defined from star imports
(F405)
246-246: wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports
(F405)
250-250: wait_for_same_commit
may be undefined, or defined from star imports
(F405)
267-267: Path
may be undefined, or defined from star imports
(F405)
270-270: Path
may be undefined, or defined from star imports
(F405)
275-275: processes
may be undefined, or defined from star imports
(F405)
279-279: start_peer
may be undefined, or defined from star imports
(F405)
288-288: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
291-291: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
294-294: wait_for_all_replicas_active
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
🔇 Additional comments (2)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)
226-227
: No concerns identified.The additional assertion lines here are straightforward and align with the desired outcome.
230-273
: Verify whether shard corruption is intentionally skipped.The line
corrupt_shard_dir(shard_path)
is commented out. Since you’ve defined this utility, please confirm if you intend to skip shard directory corruption for this test. If not, uncomment the call or remove the function if it is no longer needed.🧰 Tools
🪛 Ruff (0.8.2)
231-231:
assert_project_root
may be undefined, or defined from star imports(F405)
237-237:
start_cluster
may be undefined, or defined from star imports(F405)
246-246:
wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports(F405)
250-250:
wait_for_same_commit
may be undefined, or defined from star imports(F405)
267-267:
Path
may be undefined, or defined from star imports(F405)
270-270:
Path
may be undefined, or defined from star imports(F405)
9f90e15
to
a5867bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (1)
79-87
: Fix typo in variable nameThere's a typo in the variable name
segements_path
which should besegments_path
. Variable names should be consistent with their usage throughout the code.def corrupt_shard_dir(shard_path: pathlib.Path): - segements_path = shard_path / "segments" - segment_names = filter(lambda x: x.is_dir(), pathlib.Path(segements_path).iterdir()) + segments_path = shard_path / "segments" + segment_names = filter(lambda x: x.is_dir(), pathlib.Path(segments_path).iterdir()) first_segment_path = next(segment_names) shutil.rmtree(first_segment_path) wal_path = shard_path / "wal" shutil.rmtree(wal_path)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/consensus_tests/test_failed_snapshot_recovery.py
(3 hunks)tests/consensus_tests/utils.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/consensus_tests/utils.py
🧰 Additional context used
🧬 Code Definitions (1)
tests/consensus_tests/test_failed_snapshot_recovery.py (2)
tests/consensus_tests/utils.py (7)
assert_project_root
(90-93)start_cluster
(189-214)wait_collection_exists_and_active_on_all_peers
(710-714)wait_for_same_commit
(549-554)kill
(33-39)start_peer
(116-149)wait_for_all_replicas_active
(533-538)tests/consensus_tests/fixtures.py (2)
upsert_random_points
(83-129)random_dense_vector
(19-20)
🪛 Ruff (0.8.2)
tests/consensus_tests/test_failed_snapshot_recovery.py
229-229: pytest
may be undefined, or defined from star imports
(F405)
231-231: assert_project_root
may be undefined, or defined from star imports
(F405)
237-237: start_cluster
may be undefined, or defined from star imports
(F405)
246-246: wait_collection_exists_and_active_on_all_peers
may be undefined, or defined from star imports
(F405)
250-250: wait_for_same_commit
may be undefined, or defined from star imports
(F405)
267-267: Path
may be undefined, or defined from star imports
(F405)
270-270: Path
may be undefined, or defined from star imports
(F405)
275-275: processes
may be undefined, or defined from star imports
(F405)
279-279: start_peer
may be undefined, or defined from star imports
(F405)
288-288: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
291-291: wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports
(F405)
294-294: wait_for_all_replicas_active
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: test-low-resources
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: lint
🔇 Additional comments (5)
tests/consensus_tests/test_failed_snapshot_recovery.py (5)
272-272
: Commented-out function callThe
corrupt_shard_dir
function is implemented but not used as it's commented out. This raises questions about the test coverage. Is the corruption intentionally disabled?Based on previous review comments, it appears this function was added as a suggestion, but the logic primarily depends on the shard initializing flag, which explains why it might be commented out.
If the corruption is intentionally not applied, consider adding a comment explaining why or removing the commented-out code.
229-235
: LGTM! Comprehensive test parameters and configurationThe test is well-designed with different transfer methods and explicitly disables error handling to better simulate production scenarios. This aligns with the previous review comment where it was suggested to explicitly disable the
handle_collection_load_errors
flag.🧰 Tools
🪛 Ruff (0.8.2)
229-229:
pytest
may be undefined, or defined from star imports(F405)
231-231:
assert_project_root
may be undefined, or defined from star imports(F405)
265-268
: Simulation strategy looks goodThe approach of creating the initializing flag to simulate a dirty shard is a good testing strategy. This simulates Qdrant being killed during shard directory movement, which can happen in production environments.
🧰 Tools
🪛 Ruff (0.8.2)
267-267:
Path
may be undefined, or defined from star imports(F405)
288-291
: Verify transfer completion sequence is correctThe code has been updated according to the previous review comment to first wait for a transfer to start and then wait for it to complete. This is the right sequence as transfers might not start instantly.
🧰 Tools
🪛 Ruff (0.8.2)
288-288:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
291-291:
wait_for_collection_shard_transfers_count
may be undefined, or defined from star imports(F405)
316-320
: Good search verificationThe test properly verifies that search results remain consistent before and after recovery, which is an important validation for data integrity after shard recovery.
The same commits are now part of #6293. So closing this now. |
Add a new integration test that ensures that Qdrant is able to load a dirty shard (shard initializing flag still present). This test is expected to fail currently because Qdrant isn't doing the right thing. There will be another PR on top for this that fixes the test failure.
It currently loads a dummy shard and stays like that (since we have
handle_collection_load_errors: true
in ourdevelopment.yaml
). Ideally it should initiate a transfer and get the local shard fixed.All Submissions:
dev
branch. Did you create your branch fromdev
?