-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Batched reader for segment construction #6487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Add sequential reads and improve BatchedVectorReader * implement get_many_sequential --------- Co-authored-by: generall <andrey@vasnetsov.com>
📝 WalkthroughWalkthroughThis change introduces sequential vector access methods across multiple vector storage implementations and the core Suggested reviewers
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)
199-207
: Minor optimisation: avoid extra allocation
T::into_float_multivector
already returns aCowMultiVector
. Wrapping it inside a temporary and then converting toCowVector::MultiDense
creates an extraCow
layer. A small rearrangement avoids this and shortens the call-chain:- .map(|multi_dense_vector| { - CowVector::MultiDense(T::into_float_multivector(CowMultiVector::Borrowed( - multi_dense_vector, - ))) - }) + .map(|mv| CowVector::from(T::into_float_multivector(CowMultiVector::Borrowed(mv))))(Not performance-critical, but keeps the hot path lean.)
lib/segment/src/vector_storage/vector_storage_base.rs (1)
640-666
: Match-arm explosion – consider a macro orenum_dispatch
The new
get_vector_sequential
implementation duplicates the long
pattern-matching blocks already present forget_vector
,datatype
,
etc. Each addition of a storage variant now requires copying another
block, making the code error-prone.Rust crates like
enum_dispatch
or a simple local macro could collapse
these matches into a single declarative list:macro_rules! delegate { ($self:expr, $key:expr, $method:ident) => { match $self { $(VectorStorageEnum::$variant(v) => v.$method($key),)* } }; }(Not a blocker for this PR, but worth addressing before the list grows
again.)lib/segment/src/segment_constructor/batched_reader.rs (4)
30-37
: DeriveDebug
& tighten visibility
PointData
andBatchedVectorReader
will likely appear in logs or
tests. DerivingDebug
(and potentiallyClone
) makes inspection and
reuse easier without impacting performance.-pub struct PointData { +#[derive(Debug)] +pub struct PointData { ... -pub struct BatchedVectorReader<'a> { +#[derive(Debug)] +pub struct BatchedVectorReader<'a> {
40-56
: Avoid pre-filling the buffer with default vectorsInitialising the whole buffer with
CowVector::default()
costsO(BATCH_SIZE)
and performs unnecessary allocations. A more efficient pattern is to
allocate the vector with capacity andpush
actual items, or keep a
Vec<MaybeUninit<_>>
and assume full overwrite.- // We need to allocate the buffer with the size of the batch, - // but we don't know the size of the vectors. - // So we use a placeholder vector with size 0. - let buffer = vec![(CowVector::default(), false); BATCH_SIZE]; + // Allocate empty; will be extended to at most `BATCH_SIZE` + // in `refill_buffer`. + let buffer = Vec::with_capacity(BATCH_SIZE);Remember to
buffer.clear()
inrefill_buffer
before repopulating.
68-82
: Missedseg_to_points_buffer.clear()
leads to accumulating entries
drain()
empties the map but leaves the allocation intact; successive
refills will grow a newVec
for every point pushed, never reusing the
previous capacity. Callingseg_to_points_buffer.clear()
before reuse
keeps allocations stable:- // Read by segments, as we want to localize reads as much as possible. + self.seg_to_points_buffer.clear(); + + // Read by segments to localise I/O.
83-91
: Per-point fetch negates sequential read advantageInside each segment loop we still call
get_vector_sequential
one point at a time. For dense storages able to
return slices, a batched variant (get_dense_batch
,get_many_sequential
, …)
would eliminate syscall/lock overhead and maximise readahead.If feasible:
// 1. Collect internal ids for this segment. // 2. Storage-specific fast path to load them in one call. // 3. Fall back to current per-id loop otherwise.This would better exploit the sequential APIs added in this PR.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
lib/segment/src/fixtures/index_fixtures.rs
(1 hunks)lib/segment/src/segment_constructor/batched_reader.rs
(1 hunks)lib/segment/src/segment_constructor/mod.rs
(1 hunks)lib/segment/src/segment_constructor/segment_builder.rs
(2 hunks)lib/segment/src/vector_storage/chunked_mmap_vectors.rs
(2 hunks)lib/segment/src/vector_storage/chunked_vector_storage.rs
(2 hunks)lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs
(1 hunks)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs
(1 hunks)lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs
(1 hunks)lib/segment/src/vector_storage/in_ram_persisted_vectors.rs
(2 hunks)lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs
(2 hunks)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs
(2 hunks)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs
(1 hunks)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs
(1 hunks)lib/segment/src/vector_storage/vector_storage_base.rs
(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (7)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (7)
lib/segment/src/fixtures/index_fixtures.rs (1)
get_vector_sequential
(65-67)lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
get_vector_sequential
(126-131)lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
get_vector_sequential
(223-226)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
get_vector_sequential
(231-234)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
get_vector_sequential
(364-366)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
get_vector_sequential
(186-193)lib/segment/src/vector_storage/vector_storage_base.rs (2)
get_vector_sequential
(65-65)get_vector_sequential
(640-666)
lib/segment/src/fixtures/index_fixtures.rs (5)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
get_vector_sequential
(223-226)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
get_vector_sequential
(231-234)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
get_vector_sequential
(364-366)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
get_vector_sequential
(194-197)lib/segment/src/vector_storage/vector_storage_base.rs (2)
get_vector_sequential
(65-65)get_vector_sequential
(640-666)
lib/segment/src/vector_storage/chunked_vector_storage.rs (2)
lib/segment/src/vector_storage/chunked_mmap_vectors.rs (2)
get_sequential
(360-362)get_many_sequential
(410-412)lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (2)
get_sequential
(48-50)get_many_sequential
(99-102)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (8)
lib/segment/src/fixtures/index_fixtures.rs (1)
get_vector_sequential
(65-67)lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
get_vector_sequential
(223-226)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
get_vector_sequential
(231-234)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
get_vector_sequential
(364-366)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
get_vector_sequential
(186-193)lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)
get_vector_sequential
(199-207)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
get_vector_sequential
(194-197)lib/segment/src/vector_storage/vector_storage_base.rs (2)
get_vector_sequential
(65-65)get_vector_sequential
(640-666)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (8)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (2)
get_multi_opt_sequential
(123-140)get_vector_sequential
(199-207)lib/segment/src/vector_storage/vector_storage_base.rs (3)
get_multi_opt_sequential
(157-158)get_vector_sequential
(65-65)get_vector_sequential
(640-666)lib/segment/src/fixtures/index_fixtures.rs (1)
get_vector_sequential
(65-67)lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
get_vector_sequential
(126-131)lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
get_vector_sequential
(223-226)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
get_vector_sequential
(231-234)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
get_vector_sequential
(186-193)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
get_vector_sequential
(194-197)
lib/segment/src/vector_storage/vector_storage_base.rs (8)
lib/segment/src/fixtures/index_fixtures.rs (1)
get_vector_sequential
(65-67)lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
get_vector_sequential
(126-131)lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
get_vector_sequential
(223-226)lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
get_vector_sequential
(231-234)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)
get_vector_sequential
(364-366)get_multi_opt_sequential
(313-319)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
get_vector_sequential
(186-193)lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (2)
get_vector_sequential
(199-207)get_multi_opt_sequential
(123-140)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
get_vector_sequential
(194-197)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (4)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)
get_multi_opt_sequential
(313-319)get_vector_sequential
(364-366)lib/segment/src/vector_storage/vector_storage_base.rs (3)
get_multi_opt_sequential
(157-158)get_vector_sequential
(65-65)get_vector_sequential
(640-666)lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
get_vector_sequential
(126-131)lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
get_vector_sequential
(186-193)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: lint
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (17)
lib/segment/src/segment_constructor/mod.rs (1)
1-1
: LGTM - Added new module declaration.The addition of the
batched_reader
module is a clean change that enhances the organization of the segment construction functionality.lib/segment/src/fixtures/index_fixtures.rs (1)
65-67
: Implementation matches trait requirement properly.This implementation of
get_vector_sequential
correctly delegates to the existingget_vector
method, which is appropriate for an in-memory test fixture where no sequential optimization is possible. This matches the pattern used in other in-memory vector storage implementations.lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
194-197
: Well-documented implementation of sequential access.The implementation correctly delegates to the existing
get_vector
method with a helpful comment explaining why no optimization is applied for in-memory storage. This is consistent with the implementation pattern in other vector storage types.lib/segment/src/segment_constructor/segment_builder.rs (2)
42-42
: LGTM - Added imports for the new batched reader functionality.Properly importing
BatchedVectorReader
andPointData
from the new module.
345-350
: Improved vector retrieval with batched processing.The refactoring replaces manual iteration with the new
BatchedVectorReader
, which should improve performance by:
- Batching vector reads (processing 256 points at a time)
- Grouping points by segment index to optimize sequential access
- Leveraging the new
get_vector_sequential
methods across vector storage implementationsThis change aligns perfectly with the PR objective of introducing batched reading for segment construction.
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
223-226
: Implementation pattern aligns with trait extension for sequential access.The implementation correctly delegates to the existing
get_vector
method since this is an in-memory storage where sequential access doesn't provide performance benefits. The comment makes this design decision clear.lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
231-234
: Implementation follows the expected pattern for sequential vector access.This implementation correctly delegates to
get_vector
with an explanatory comment indicating that no optimizations are currently available for gridstore. The comment's "yet" suggests potential future optimization opportunities.lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
126-131
: Good implementation leveraging underlying sequential access optimizations.This implementation properly leverages the
get_sequential
method from the underlyingChunkedVectorStorage
implementation, which can provide optimized sequential reads. The vector type conversion is handled correctly for consistency with the rest of the API.lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
186-193
: Effective use of sequential access optimization for memory-mapped storage.The implementation properly utilizes
get_vector_opt_sequential
from the underlyingmmap_store
, which can provide significant performance improvements for sequential access patterns in memory-mapped storage. The vector type conversion matches the pattern used in other similar implementations.lib/segment/src/vector_storage/chunked_mmap_vectors.rs (2)
360-362
: Implementation adds sequential access optimization.The
get_sequential
method delegates to the existingget
method withforce_sequential=true
, enabling sequential memory access optimizations when retrieving single vectors.
409-412
: Implementation adds sequential batch access optimization.The
get_many_sequential
method properly delegates to the existingget_many
method withforce_sequential=true
, enabling sequential memory access optimizations when retrieving multiple vectors.lib/segment/src/vector_storage/chunked_vector_storage.rs (2)
21-21
: Sequential vector retrieval API added to trait.This extends the trait interface to support optimized sequential access patterns.
51-52
: Sequential batch retrieval API added to trait with proper documentation.The method documentation maintains consistency with the existing
get_many
method documentation.lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)
313-319
: Implementation of sequential access for multi-vector storage.The implementation correctly falls back to the non-sequential version with a helpful comment explaining why no optimization is done for in-memory storage.
364-366
: Implementation of sequential vector access for standard vector interface.The method correctly delegates to the existing implementation since no sequential optimization is possible for this in-memory storage type.
lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (1)
48-50
: Sequential access delegated to underlying storage.The implementation correctly forwards to the sequential access method of the underlying mmap storage.
lib/segment/src/segment_constructor/batched_reader.rs (1)
94-99
: Buffer contents beyond the last batch retain large vectorsOn the final iteration where
points_to_insert.len() % BATCH_SIZE != 0
,
slots that are no longer used keepCowVector
s from the previous batch,
holding references and memory longer than needed.Calling
buffer.truncate(batch_len)
after filling (or switching to the
push pattern suggested above) releases them earlier and reduces peak
RSS.
/// Returns None if key is not found | ||
fn get_multi_opt_sequential( | ||
&self, | ||
key: PointOffsetType, | ||
) -> Option<TypedMultiDenseVectorRef<T>> { | ||
self.offsets | ||
.get_sequential(key as VectorOffsetType) | ||
.and_then(|mmap_offset| { | ||
let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty"); | ||
self.vectors.get_many_sequential( | ||
mmap_offset.offset as VectorOffsetType, | ||
mmap_offset.count as usize, | ||
) | ||
}) | ||
.map(|flattened_vectors| TypedMultiDenseVectorRef { | ||
flattened_vectors, | ||
dim: self.vectors.dim(), | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential panic on empty slice & duplication with get_multi_opt
get_multi_opt_sequential
unwraps the first element of mmap_offset
with expect
, identical to the non-sequential variant.
If offsets.get_sequential(..)
returns an empty slice (it returns Some(&[])
when the key exists but the chunk is present yet empty), the call to first()
will yield None
and we will panic even though the function signature promises Option
.
Consider defensive handling and re-using the existing implementation to keep behaviour consistent and avoid code duplication:
- .and_then(|mmap_offset| {
- let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty");
- self.vectors.get_many_sequential(
- mmap_offset.offset as VectorOffsetType,
- mmap_offset.count as usize,
- )
- })
+ .and_then(|mmap_offset| mmap_offset.first().and_then(|off| {
+ self.vectors
+ .get_many_sequential(off.offset as VectorOffsetType, off.count as usize)
+ }))
(You might even delegate to get_multi_opt
when the fast path fails.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// Returns None if key is not found | |
fn get_multi_opt_sequential( | |
&self, | |
key: PointOffsetType, | |
) -> Option<TypedMultiDenseVectorRef<T>> { | |
self.offsets | |
.get_sequential(key as VectorOffsetType) | |
.and_then(|mmap_offset| { | |
let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty"); | |
self.vectors.get_many_sequential( | |
mmap_offset.offset as VectorOffsetType, | |
mmap_offset.count as usize, | |
) | |
}) | |
.map(|flattened_vectors| TypedMultiDenseVectorRef { | |
flattened_vectors, | |
dim: self.vectors.dim(), | |
}) | |
} | |
/// Returns None if key is not found | |
fn get_multi_opt_sequential( | |
&self, | |
key: PointOffsetType, | |
) -> Option<TypedMultiDenseVectorRef<T>> { | |
self.offsets | |
.get_sequential(key as VectorOffsetType) | |
.and_then(|mmap_offset| mmap_offset.first().and_then(|off| { | |
self.vectors | |
.get_many_sequential(off.offset as VectorOffsetType, off.count as usize) | |
})) | |
.map(|flattened_vectors| TypedMultiDenseVectorRef { | |
flattened_vectors, | |
dim: self.vectors.dim(), | |
}) | |
} |
/// Get the vector by the given key with potential optimizations for sequential reads. | ||
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Provide a default fallback for get_vector_sequential
All concrete storages had to implement a trivial
get_vector_sequential
that just calls get_vector
.
Adding a default implementation inside the trait eliminates boiler-plate
and prevents future breakage when a new storage is added but the
sequential fast-path is not important.
- fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector;
+ fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector {
+ // Fallback to the random-access version. Storages can override for
+ // true sequential I/O optimisation.
+ self.get_vector(key)
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// Get the vector by the given key with potential optimizations for sequential reads. | |
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector; | |
/// Get the vector by the given key with potential optimizations for sequential reads. | |
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector { | |
// Fallback to the random-access version. Storages can override for | |
// true sequential I/O optimisation. | |
self.get_vector(key) | |
} |
fn get_multi_opt_sequential(&self, key: PointOffsetType) | ||
-> Option<TypedMultiDenseVectorRef<T>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Same fallback for get_multi_opt_sequential
For symmetry with the dense case we can avoid duplicating “no-op”
implementations in every in-memory multi-vector storage:
- fn get_multi_opt_sequential(&self, key: PointOffsetType)
- -> Option<TypedMultiDenseVectorRef<T>>;
+ fn get_multi_opt_sequential(
+ &self,
+ key: PointOffsetType,
+ ) -> Option<TypedMultiDenseVectorRef<T>> {
+ self.get_multi_opt(key)
+ }
* naive implementation of batched reader for segment construction * Make reads sequential in BatchedVectorReader (#6508) * Add sequential reads and improve BatchedVectorReader * implement get_many_sequential --------- Co-authored-by: generall <andrey@vasnetsov.com> * implement get_many_sequential --------- Co-authored-by: Jojii <15957865+JojiiOfficial@users.noreply.github.com>
Main measurements are in
#6508