Skip to content

Conversation

ffuugoo
Copy link
Contributor

@ffuugoo ffuugoo commented Jun 19, 2025

Get rid of PartialSnapshotMeta::search_lock reimplemented read request handling during partial recovery based on ShardReplicaSet::local and PartialSnapshotMeta::recovery_lock.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@ffuugoo ffuugoo changed the title WIP: Fan out if local shard is unavailable during search, due to partial snapshot Fix read latency spikes during partial snapshot recovery Jun 20, 2025
@ffuugoo ffuugoo force-pushed the partial-snapshot-local-shard-read-lock branch from aba711b to cc013cd Compare June 20, 2025 15:45
@KShivendu
Copy link
Member

@ffuugoo have started a docker image build https://github.com/qdrant/qdrant/actions/runs/15821075811

Should be deployed at ghcr.io/qdrant/qdrant:partial-snapshot-local-shard-read-lock. Will try to deploy in our testing clusters for you to test.

@ffuugoo ffuugoo changed the title Fix read latency spikes during partial snapshot recovery Improve read request handling during partial-snapshot recovery Jun 23, 2025
@@ -122,8 +134,6 @@ impl ShardReplicaSet {
)));
};

let _partial_snapshot_search_lock = self.try_take_search_read_lock()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was useless/harmful.

Partial snapshot recovery takes local.write(). If we already managed to take local.read() during execute_local_read_operation, there's no need to try_take_search_read_lock. In worst case it will make us reject read request that we could have successfully performed.

None => (None, false, None),
};

let local_is_active = self.peer_is_active(self.this_peer_id());

let local_operation = if local_is_active {
let local_operation = async {
let _partial_snapshot_search_lock = self.try_take_search_read_lock()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was useless/harmful.

Similar to try_take_search_read_lock above. If we already managed to take local.read(), this check may force us to reject read request, that we could have successfully served.

@ffuugoo ffuugoo force-pushed the partial-snapshot-local-shard-read-lock branch from 880f94a to f6bda1f Compare June 23, 2025 11:08
/// Limits parallel *recover* partial snapshot requests. We are using `RwLock`, so that multiple
/// read requests can check if recovery is in progress (by doing `try_read`) without blocking
/// each other.
recovery_lock: Arc<tokio::sync::RwLock<()>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now (sometimes) check if recovery_lock is locked during execute_local_read_operation, and so I've swapped Mutex for RwLock, so that concurrent read operations can do try_read without blocking each other.

@ffuugoo ffuugoo marked this pull request as ready for review June 23, 2025 13:13
@ffuugoo ffuugoo requested a review from timvisee June 23, 2025 13:13

This comment was marked as resolved.

@timvisee timvisee removed their request for review June 25, 2025 09:50
@ffuugoo ffuugoo force-pushed the partial-snapshot-local-shard-read-lock branch from e150f14 to 5d7ff74 Compare June 25, 2025 11:13
@ffuugoo ffuugoo requested review from timvisee and generall June 25, 2025 11:13
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This makes sense. I find the use of a watch channel to cancel a waiting read lock a bit convoluted, but I get why we're using it here. Another lock would not give us the same flexibility.

IIRC you mentioned in call this has the potential for a race, but I don't see a concrete problem.

@ffuugoo ffuugoo requested a review from timvisee June 25, 2025 12:45
Remove accidental `\`

Co-authored-by: Tim Visée <tim+github@visee.me>
@ffuugoo
Copy link
Contributor Author

ffuugoo commented Jun 25, 2025

Thanks!

This makes sense. I find the use of a watch channel to cancel a waiting read lock a bit convoluted, but I get why we're using it here. Another lock would not give us the same flexibility.

IIRC you mentioned in call this has the potential for a race, but I don't see a concrete problem.

No more race, but I've made yet another update because a simple RwLock should be enough in the end. 🤦‍♀️

@timvisee timvisee merged commit 4705bff into dev Jun 26, 2025
18 checks passed
@timvisee timvisee deleted the partial-snapshot-local-shard-read-lock branch June 26, 2025 08:35
generall pushed a commit that referenced this pull request Jul 17, 2025
* Fix read request handling during partial snapshot recovery

* fixup! Fix read request handling during partial snapshot recovery

Just use a second lock 😔

* fixup! Fix read request handling during partial snapshot recovery

Remove accidental `\`

Co-authored-by: Tim Visée <tim+github@visee.me>

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants