Fix read latency spikes during partial snapshot recovery #6862

ffuugoo · 2025-07-14T12:26:41Z

TL;DR:

I've noticed that all latency spikes that we observe in RW-seg cluster follow the same pattern
following the logs/pattern I found that we call std::thread::JoinHandle::join on the async runtime during partial snapshot recovery
I suspect, that the spikes are caused by this "blocking call on async runtime"
this PR implements the most trivial workaround, so that we can quickly merge it and see if it fixes the issue on RW-seg cluster

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

🙄🤦‍♀️

Use `spawn_blocking` instead of `block_in_place`

ffuugoo · 2025-07-14T13:12:36Z

So, here's a breakdown:

partial snapshot recovery is started (bottom log line)
there's a Local shard 1 not found message
- which simply means that local replica is unavailable, because partial snapshot is being recovered on the node
- (the wording is a bit confusing in this case)
however, by the time this message is logged in execute_cluster_read_operation, all "locks" required to execute read-request should have been acquired already
- so, the only thing that is being executed during these 2 seconds should be network requests to remote nodes
- (and preparing response to the user, but this is pure CPU work that should be practically instant)
judging by the timestamp of the message, it correlates exactly with the total response time
- e.g., 07:55:58.294 - 07:55:56.006 = 2.288s
we can also see that there are Recovering shard... and Recovered collection... logs right between Local shard 1 not found and the top log line
- and all spikes are like these: start recovery, local shard not found, recovering/recovered, search finished (with timestamps correlating exactly)
the network on rw-seg is fast and remote replicas should not be blocked by anything
- (partial snapshot is recovered on the same node, that is executing the client request)
but if there's something blocking the async runtime, that might explain why requests are stalling...
- and lo and behold, there is a blocking call right in the code path that we know is executed in parallel with our request 🙄

Avoid blocking async runtime when loading local shard

10c85d5

🙄🤦‍♀️

ffuugoo requested review from timvisee and generall July 14, 2025 12:26

This comment was marked as resolved.

Sign in to view

generall approved these changes Jul 14, 2025

View reviewed changes

timvisee approved these changes Jul 14, 2025

View reviewed changes

fixup! Avoid blocking async runtime when loading local shard

218e2bd

Use `spawn_blocking` instead of `block_in_place`

ffuugoo merged commit c6e38d1 into dev Jul 14, 2025
16 checks passed

ffuugoo deleted the partial-snapshot-fix-read-latency-spikes branch July 14, 2025 13:03

generall pushed a commit that referenced this pull request Jul 17, 2025

Avoid blocking async runtime when loading local shard (#6862)

262a31e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix read latency spikes during partial snapshot recovery #6862

Fix read latency spikes during partial snapshot recovery #6862

Uh oh!

ffuugoo commented Jul 14, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

ffuugoo commented Jul 14, 2025

Uh oh!

Uh oh!

Fix read latency spikes during partial snapshot recovery #6862

Fix read latency spikes during partial snapshot recovery #6862

Uh oh!

Conversation

ffuugoo commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

This comment was marked as resolved.

Uh oh!

ffuugoo commented Jul 14, 2025

Uh oh!

Uh oh!

ffuugoo commented Jul 14, 2025 •

edited

Loading