Skip to content

Conversation

ffuugoo
Copy link
Contributor

@ffuugoo ffuugoo commented Jul 14, 2025

TL;DR:

  • I've noticed that all latency spikes that we observe in RW-seg cluster follow the same pattern
  • following the logs/pattern I found that we call std::thread::JoinHandle::join on the async runtime during partial snapshot recovery
  • I suspect, that the spikes are caused by this "blocking call on async runtime"
  • this PR implements the most trivial workaround, so that we can quickly merge it and see if it fixes the issue on RW-seg cluster

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@ffuugoo ffuugoo requested review from timvisee and generall July 14, 2025 12:26

This comment was marked as resolved.

Use `spawn_blocking` instead of `block_in_place`
@ffuugoo ffuugoo merged commit c6e38d1 into dev Jul 14, 2025
16 checks passed
@ffuugoo ffuugoo deleted the partial-snapshot-fix-read-latency-spikes branch July 14, 2025 13:03
@ffuugoo
Copy link
Contributor Author

ffuugoo commented Jul 14, 2025

So, here's a breakdown:

Screenshot 2025-07-14 at 14 33 41
  • partial snapshot recovery is started (bottom log line)
  • there's a Local shard 1 not found message
    • which simply means that local replica is unavailable, because partial snapshot is being recovered on the node
    • (the wording is a bit confusing in this case)
  • however, by the time this message is logged in execute_cluster_read_operation, all "locks" required to execute read-request should have been acquired already
    • so, the only thing that is being executed during these 2 seconds should be network requests to remote nodes
    • (and preparing response to the user, but this is pure CPU work that should be practically instant)
  • judging by the timestamp of the message, it correlates exactly with the total response time
    • e.g., 07:55:58.294 - 07:55:56.006 = 2.288s
  • we can also see that there are Recovering shard... and Recovered collection... logs right between Local shard 1 not found and the top log line
    • and all spikes are like these: start recovery, local shard not found, recovering/recovered, search finished (with timestamps correlating exactly)
  • the network on rw-seg is fast and remote replicas should not be blocked by anything
    • (partial snapshot is recovered on the same node, that is executing the client request)
  • but if there's something blocking the async runtime, that might explain why requests are stalling...
    • and lo and behold, there is a blocking call right in the code path that we know is executed in parallel with our request 🙄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants