Skip to content

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented Feb 28, 2025

Optimize the background task that takes care of cleaning up shards, specifically deleting points from it that don't belong in it anymore.

This change makes the cleanup process a lot faster, which means it'll complete in less wall clock time.

The primary change is about how we select what points to delete. Before, we used a hash ring filter during the scroll request to only scroll points that don't belong into the shard anymore. Now, we simply scroll all point IDs and filter the list after the scroll request. The key thing here is that the hash ring check is very expensive. Before, the hash ring check was done a lot more often across all segments in the shard. Now we only check each point ID once.

To summarize, this PR does two things:

  1. move hash ring check outside of scroll operation
  2. use wait=false on all delete operations, except the last one

I don't have a fancy graph, but I do have performance numbers. This is on a single node, with 5 million points, resharding from 1 to 2 shards, with 2 or 100 segments.

Time for shard cleanup to complete:

  1. Before this PR:
    • 2 segments: 115 seconds
    • 100 segments: 72 seconds
  2. With hash ring check outside scroll operation:
    • 2 segments: 35 seconds
    • 100 segments: 53 seconds
  3. With wait=false:
    • 2 segments: 20 seconds
    • 100 segments: 43 seconds

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee changed the title Optimize cleanup of shards, using in resharding Optimize cleanup of shards, used in resharding Feb 28, 2025
@agourlay
Copy link
Member

Out of curiosity, why is a check using the Hashring so expensive?

@generall
Copy link
Member

I am not entirely understand why number of comparisons is different. In scroll we do not check all points, but only points which are candidates to our limit+offset pool. Is that because we search top-N results in each segment independently?

@generall generall merged commit 6fd3bba into dev Feb 28, 2025
17 checks passed
@generall generall deleted the resharding-optimize-clean-shard branch February 28, 2025 17:33
@generall
Copy link
Member

CPU usage before and after this fix:
image

@timvisee
Copy link
Member Author

timvisee commented Mar 3, 2025

Is that because we search top-N results in each segment independently?

Yes

CPU usage before and after this fix:
image

Awesome! Thank you for sharing the results. The difference seems bigger than I expected.

I assume that part of the spikes in the new part of the graph are just because of indexing.

timvisee added a commit that referenced this pull request Mar 21, 2025
* Don't filter hash ring during scroll, but after

* Only use wait=true in last delete batch while cleaning points

* Link to pull request
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants