Skip to content

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented Mar 26, 2025

Optimize the resharding stream records shard transfer function.

This applies two techniques to optimize during resharding:

  1. preselect list of point IDs before reading point data, prevent discarding a lot of read data
  2. oversample batch size because we only send a fraction of points, prevent transferring small batches

Before this PR, a resharding transfer would read all points from the shard, including all vector and payload data. But, only a fraction of those read points are actually transferred to the remote shard. That means that 50% (1->2 shards) or even 95% (19->20 shards) of the read data was immediately discarded. That's a waste of resources!

Reading all that data can be very expensive, especially on a cluster that has huge vectors/payloads or on a cluster that has very high memory pressure. This PR changes the logic to scroll just point IDs first. We then make a preselection of point IDs by filtering them using the resharding hash ring. Finally we retrieve vectors and payloads for exactly that preselection so that 100% of the read data will be transferred.

Along with that this now also applies oversampling to the batch size. It accounts for the points that will be filtered out by the hash ring. If we'd do a resharding transfer for going from 3 to 4 shards, only 25% of the points in the shard will be transferred. If we provide a batch size of 100, it'd mean that we'd only transfer roughly 25 points in each batch before. Now the batch size is multiplied by 4 so that we still transfer roughly 100 points in each batch.

In code I've extracted the logic for reading the batch into two separate functions. One for regular batches, and one specialized for batching using a hash ring applying the above optimizations.

Basic benchmark here.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee changed the title Resharding: only read vectors/payload of points to transfer Resharding: optimize resharding stream records transfer Mar 27, 2025
@timvisee timvisee marked this pull request as ready for review March 27, 2025 09:44
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the resharding stream records transfer by reducing unnecessary data reads and applying an oversampling technique to preserve batch sizes when using a hash ring filter. It refactors the batching logic into two functions—one for regular transfers and one for transfers using hash rings—and applies oversampling to account for filtering out points.

Comments suppressed due to low confidence (2)

lib/collection/src/shards/forward_proxy_shard.rs:117

  • [nitpick] Consider rephrasing 'real number of transferred points' to 'actual number of transferred points' for clarity.
/// Returns new point offset and real number of transferred points. The new point offset can be

lib/collection/src/shards/forward_proxy_shard.rs:250

  • Although the design likely ensures 'new' is non-empty, it might be beneficial to assert or handle the case where the new shard list is empty to avoid a multiplication by zero in the oversample_factor calculation.
HashRingRouter::Resharding { old: _, new } => new.len(),

This comment was marked as resolved.

Copy link
Member

@KShivendu KShivendu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a small mistake. Rest LGTM

I wonder much speed up we will observe in practice :D

@timvisee
Copy link
Member Author

timvisee commented Mar 27, 2025

I did some testing, results are about what I did expect.

Setup:

  • 4 peers (all on same host)
  • 3 shards, resharding to 4
  • replication factor 1
  • 512d vectors
  • integer and large text payload
  • everything on-disk
  • size of collection is different per type of storage, and not super relevant

I did measure the wall clock time for the three resharding transfers that happen when resharding up.

Result:

  • On ramdisk:
    • Before PR: 17s, 23s, 30s
    • With PR: 17s, 23s, 30s
  • On NVMe drive:
    • Before PR: 16s, 17s, 17s
    • With PR: 10s, 12s, 16s
  • On old SATA HDD:
    • Before PR: 81s, 110s, 116s
    • With PR: 81s, 85s, 88s

This is definitely not a scientific test, but this naive approach already shows an improvement. I expect (but did not test) the effect to be a lot greater with networked storage and/or with high memory pressure.

bfb command to set up collection:

bfb --uri http://127.0.0.1:6334/,http://127.0.0.2:6334/,http://127.0.0.3:6334/,http://127.0.0.4:6334/ -t3 -p3 -n 500000 -d 512 --shards 3 --replication-factor 1 --indexing-threshold 0 --skip-wait-index --int-payloads 100 --text-payloads --text-payload-length 512 --timestamp-payload --retry 3 --on-disk-payload --on-disk-payload-index --on-disk-index true --on-disk-vectors true --timeout 20

@timvisee timvisee merged commit 9382df0 into dev Mar 28, 2025
17 checks passed
@timvisee timvisee deleted the resharding-only-read-points-to-transfer branch March 28, 2025 10:38
timvisee added a commit that referenced this pull request Mar 31, 2025
* In a resharding transfer, only read vectors/payloads of relevant points

* Extract batch reading logic into function

* Apply oversampling in case of resharding

* Apply review suggestions

* Fix parsing point IDs from first scroll batch
@timvisee timvisee mentioned this pull request Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants