-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Resharding: optimize resharding stream records transfer #6260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the resharding stream records transfer by reducing unnecessary data reads and applying an oversampling technique to preserve batch sizes when using a hash ring filter. It refactors the batching logic into two functions—one for regular transfers and one for transfers using hash rings—and applies oversampling to account for filtering out points.
Comments suppressed due to low confidence (2)
lib/collection/src/shards/forward_proxy_shard.rs:117
- [nitpick] Consider rephrasing 'real number of transferred points' to 'actual number of transferred points' for clarity.
/// Returns new point offset and real number of transferred points. The new point offset can be
lib/collection/src/shards/forward_proxy_shard.rs:250
- Although the design likely ensures 'new' is non-empty, it might be beneficial to assert or handle the case where the new shard list is empty to avoid a multiplication by zero in the oversample_factor calculation.
HashRingRouter::Resharding { old: _, new } => new.len(),
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a small mistake. Rest LGTM
I wonder much speed up we will observe in practice :D
I did some testing, results are about what I did expect. Setup:
I did measure the wall clock time for the three resharding transfers that happen when resharding up. Result:
This is definitely not a scientific test, but this naive approach already shows an improvement. I expect (but did not test) the effect to be a lot greater with networked storage and/or with high memory pressure. bfb command to set up collection: bfb --uri http://127.0.0.1:6334/,http://127.0.0.2:6334/,http://127.0.0.3:6334/,http://127.0.0.4:6334/ -t3 -p3 -n 500000 -d 512 --shards 3 --replication-factor 1 --indexing-threshold 0 --skip-wait-index --int-payloads 100 --text-payloads --text-payload-length 512 --timestamp-payload --retry 3 --on-disk-payload --on-disk-payload-index --on-disk-index true --on-disk-vectors true --timeout 20 |
* In a resharding transfer, only read vectors/payloads of relevant points * Extract batch reading logic into function * Apply oversampling in case of resharding * Apply review suggestions * Fix parsing point IDs from first scroll batch
Optimize the resharding stream records shard transfer function.
This applies two techniques to optimize during resharding:
Before this PR, a resharding transfer would read all points from the shard, including all vector and payload data. But, only a fraction of those read points are actually transferred to the remote shard. That means that 50% (1->2 shards) or even 95% (19->20 shards) of the read data was immediately discarded. That's a waste of resources!
Reading all that data can be very expensive, especially on a cluster that has huge vectors/payloads or on a cluster that has very high memory pressure. This PR changes the logic to scroll just point IDs first. We then make a preselection of point IDs by filtering them using the resharding hash ring. Finally we retrieve vectors and payloads for exactly that preselection so that 100% of the read data will be transferred.
Along with that this now also applies oversampling to the batch size. It accounts for the points that will be filtered out by the hash ring. If we'd do a resharding transfer for going from 3 to 4 shards, only 25% of the points in the shard will be transferred. If we provide a batch size of 100, it'd mean that we'd only transfer roughly 25 points in each batch before. Now the batch size is multiplied by 4 so that we still transfer roughly 100 points in each batch.
In code I've extracted the logic for reading the batch into two separate functions. One for regular batches, and one specialized for batching using a hash ring applying the above optimizations.
Basic benchmark here.
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?