Skip to content

Conversation

generall
Copy link
Member

@generall generall commented May 5, 2025

Main measurements are in

#6508

* Add sequential reads and improve BatchedVectorReader

* implement get_many_sequential

---------

Co-authored-by: generall <andrey@vasnetsov.com>
@generall generall marked this pull request as ready for review May 9, 2025 11:03
Copy link
Contributor

coderabbitai bot commented May 9, 2025

📝 Walkthrough

Walkthrough

This change introduces sequential vector access methods across multiple vector storage implementations and the core VectorStorage trait. The new method get_vector_sequential is added to the VectorStorage trait and implemented for all concrete vector storage types, often delegating to existing methods or providing a new sequential access path where possible. The ChunkedVectorStorage trait and its implementations also receive get_sequential and get_many_sequential methods for optimized sequential reads. The MultiVectorStorage trait is similarly extended with a sequential access method. A new BatchedVectorReader iterator is introduced to facilitate efficient, batched vector reads during segment construction, and the SegmentBuilder update logic is refactored to use this iterator. Additionally, a new internal module batched_reader is added to the segment_constructor module to support this functionality.

Suggested reviewers

  • coszio

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19e5758 and d6a0d3c.

📒 Files selected for processing (1)
  • lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/segment/src/vector_storage/in_ram_persisted_vectors.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: storage-compat-test
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)

199-207: Minor optimisation: avoid extra allocation

T::into_float_multivector already returns a CowMultiVector. Wrapping it inside a temporary and then converting to CowVector::MultiDense creates an extra Cow layer. A small rearrangement avoids this and shortens the call-chain:

-            .map(|multi_dense_vector| {
-                CowVector::MultiDense(T::into_float_multivector(CowMultiVector::Borrowed(
-                    multi_dense_vector,
-                )))
-            })
+            .map(|mv| CowVector::from(T::into_float_multivector(CowMultiVector::Borrowed(mv))))

(Not performance-critical, but keeps the hot path lean.)

lib/segment/src/vector_storage/vector_storage_base.rs (1)

640-666: Match-arm explosion – consider a macro or enum_dispatch

The new get_vector_sequential implementation duplicates the long
pattern-matching blocks already present for get_vector, datatype,
etc. Each addition of a storage variant now requires copying another
block, making the code error-prone.

Rust crates like enum_dispatch or a simple local macro could collapse
these matches into a single declarative list:

macro_rules! delegate {
    ($self:expr, $key:expr, $method:ident) => {
        match $self {
            $(VectorStorageEnum::$variant(v) => v.$method($key),)*
        }
    };
}

(Not a blocker for this PR, but worth addressing before the list grows
again.)

lib/segment/src/segment_constructor/batched_reader.rs (4)

30-37: Derive Debug & tighten visibility

PointData and BatchedVectorReader will likely appear in logs or
tests. Deriving Debug (and potentially Clone) makes inspection and
reuse easier without impacting performance.

-pub struct PointData {
+#[derive(Debug)]
+pub struct PointData {
...
-pub struct BatchedVectorReader<'a> {
+#[derive(Debug)]
+pub struct BatchedVectorReader<'a> {

40-56: Avoid pre-filling the buffer with default vectors

Initialising the whole buffer with CowVector::default() costs O(BATCH_SIZE)
and performs unnecessary allocations. A more efficient pattern is to
allocate the vector with capacity and push actual items, or keep a
Vec<MaybeUninit<_>> and assume full overwrite.

-        // We need to allocate the buffer with the size of the batch,
-        // but we don't know the size of the vectors.
-        // So we use a placeholder vector with size 0.
-        let buffer = vec![(CowVector::default(), false); BATCH_SIZE];
+        // Allocate empty; will be extended to at most `BATCH_SIZE`
+        // in `refill_buffer`.
+        let buffer = Vec::with_capacity(BATCH_SIZE);

Remember to buffer.clear() in refill_buffer before repopulating.


68-82: Missed seg_to_points_buffer.clear() leads to accumulating entries

drain() empties the map but leaves the allocation intact; successive
refills will grow a new Vec for every point pushed, never reusing the
previous capacity. Calling seg_to_points_buffer.clear() before reuse
keeps allocations stable:

-        // Read by segments, as we want to localize reads as much as possible.
+        self.seg_to_points_buffer.clear();
+
+        // Read by segments to localise I/O.

83-91: Per-point fetch negates sequential read advantage

Inside each segment loop we still call
get_vector_sequential one point at a time. For dense storages able to
return slices, a batched variant (get_dense_batch, get_many_sequential, …)
would eliminate syscall/lock overhead and maximise readahead.

If feasible:

// 1. Collect internal ids for this segment.
// 2. Storage-specific fast path to load them in one call.
// 3. Fall back to current per-id loop otherwise.

This would better exploit the sequential APIs added in this PR.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bdf1f3 and 19e5758.

📒 Files selected for processing (15)
  • lib/segment/src/fixtures/index_fixtures.rs (1 hunks)
  • lib/segment/src/segment_constructor/batched_reader.rs (1 hunks)
  • lib/segment/src/segment_constructor/mod.rs (1 hunks)
  • lib/segment/src/segment_constructor/segment_builder.rs (2 hunks)
  • lib/segment/src/vector_storage/chunked_mmap_vectors.rs (2 hunks)
  • lib/segment/src/vector_storage/chunked_vector_storage.rs (2 hunks)
  • lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1 hunks)
  • lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1 hunks)
  • lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1 hunks)
  • lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (2 hunks)
  • lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (2 hunks)
  • lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2 hunks)
  • lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1 hunks)
  • lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1 hunks)
  • lib/segment/src/vector_storage/vector_storage_base.rs (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (7)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (7)
lib/segment/src/fixtures/index_fixtures.rs (1)
  • get_vector_sequential (65-67)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
  • get_vector_sequential (126-131)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
  • get_vector_sequential (223-226)
lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
  • get_vector_sequential (231-234)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
  • get_vector_sequential (364-366)
lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
  • get_vector_sequential (186-193)
lib/segment/src/vector_storage/vector_storage_base.rs (2)
  • get_vector_sequential (65-65)
  • get_vector_sequential (640-666)
lib/segment/src/fixtures/index_fixtures.rs (5)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
  • get_vector_sequential (223-226)
lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
  • get_vector_sequential (231-234)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
  • get_vector_sequential (364-366)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
  • get_vector_sequential (194-197)
lib/segment/src/vector_storage/vector_storage_base.rs (2)
  • get_vector_sequential (65-65)
  • get_vector_sequential (640-666)
lib/segment/src/vector_storage/chunked_vector_storage.rs (2)
lib/segment/src/vector_storage/chunked_mmap_vectors.rs (2)
  • get_sequential (360-362)
  • get_many_sequential (410-412)
lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (2)
  • get_sequential (48-50)
  • get_many_sequential (99-102)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (8)
lib/segment/src/fixtures/index_fixtures.rs (1)
  • get_vector_sequential (65-67)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
  • get_vector_sequential (223-226)
lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
  • get_vector_sequential (231-234)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
  • get_vector_sequential (364-366)
lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
  • get_vector_sequential (186-193)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)
  • get_vector_sequential (199-207)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
  • get_vector_sequential (194-197)
lib/segment/src/vector_storage/vector_storage_base.rs (2)
  • get_vector_sequential (65-65)
  • get_vector_sequential (640-666)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (8)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (2)
  • get_multi_opt_sequential (123-140)
  • get_vector_sequential (199-207)
lib/segment/src/vector_storage/vector_storage_base.rs (3)
  • get_multi_opt_sequential (157-158)
  • get_vector_sequential (65-65)
  • get_vector_sequential (640-666)
lib/segment/src/fixtures/index_fixtures.rs (1)
  • get_vector_sequential (65-67)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
  • get_vector_sequential (126-131)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
  • get_vector_sequential (223-226)
lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
  • get_vector_sequential (231-234)
lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
  • get_vector_sequential (186-193)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
  • get_vector_sequential (194-197)
lib/segment/src/vector_storage/vector_storage_base.rs (8)
lib/segment/src/fixtures/index_fixtures.rs (1)
  • get_vector_sequential (65-67)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
  • get_vector_sequential (126-131)
lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)
  • get_vector_sequential (223-226)
lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)
  • get_vector_sequential (231-234)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)
  • get_vector_sequential (364-366)
  • get_multi_opt_sequential (313-319)
lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
  • get_vector_sequential (186-193)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (2)
  • get_vector_sequential (199-207)
  • get_multi_opt_sequential (123-140)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
  • get_vector_sequential (194-197)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (4)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)
  • get_multi_opt_sequential (313-319)
  • get_vector_sequential (364-366)
lib/segment/src/vector_storage/vector_storage_base.rs (3)
  • get_multi_opt_sequential (157-158)
  • get_vector_sequential (65-65)
  • get_vector_sequential (640-666)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
  • get_vector_sequential (126-131)
lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)
  • get_vector_sequential (186-193)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (17)
lib/segment/src/segment_constructor/mod.rs (1)

1-1: LGTM - Added new module declaration.

The addition of the batched_reader module is a clean change that enhances the organization of the segment construction functionality.

lib/segment/src/fixtures/index_fixtures.rs (1)

65-67: Implementation matches trait requirement properly.

This implementation of get_vector_sequential correctly delegates to the existing get_vector method, which is appropriate for an in-memory test fixture where no sequential optimization is possible. This matches the pattern used in other in-memory vector storage implementations.

lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)

194-197: Well-documented implementation of sequential access.

The implementation correctly delegates to the existing get_vector method with a helpful comment explaining why no optimization is applied for in-memory storage. This is consistent with the implementation pattern in other vector storage types.

lib/segment/src/segment_constructor/segment_builder.rs (2)

42-42: LGTM - Added imports for the new batched reader functionality.

Properly importing BatchedVectorReader and PointData from the new module.


345-350: Improved vector retrieval with batched processing.

The refactoring replaces manual iteration with the new BatchedVectorReader, which should improve performance by:

  1. Batching vector reads (processing 256 points at a time)
  2. Grouping points by segment index to optimize sequential access
  3. Leveraging the new get_vector_sequential methods across vector storage implementations

This change aligns perfectly with the PR objective of introducing batched reading for segment construction.

lib/segment/src/vector_storage/dense/simple_dense_vector_storage.rs (1)

223-226: Implementation pattern aligns with trait extension for sequential access.

The implementation correctly delegates to the existing get_vector method since this is an in-memory storage where sequential access doesn't provide performance benefits. The comment makes this design decision clear.

lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs (1)

231-234: Implementation follows the expected pattern for sequential vector access.

This implementation correctly delegates to get_vector with an explanatory comment indicating that no optimizations are currently available for gridstore. The comment's "yet" suggests potential future optimization opportunities.

lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)

126-131: Good implementation leveraging underlying sequential access optimizations.

This implementation properly leverages the get_sequential method from the underlying ChunkedVectorStorage implementation, which can provide optimized sequential reads. The vector type conversion is handled correctly for consistency with the rest of the API.

lib/segment/src/vector_storage/dense/memmap_dense_vector_storage.rs (1)

186-193: Effective use of sequential access optimization for memory-mapped storage.

The implementation properly utilizes get_vector_opt_sequential from the underlying mmap_store, which can provide significant performance improvements for sequential access patterns in memory-mapped storage. The vector type conversion matches the pattern used in other similar implementations.

lib/segment/src/vector_storage/chunked_mmap_vectors.rs (2)

360-362: Implementation adds sequential access optimization.

The get_sequential method delegates to the existing get method with force_sequential=true, enabling sequential memory access optimizations when retrieving single vectors.


409-412: Implementation adds sequential batch access optimization.

The get_many_sequential method properly delegates to the existing get_many method with force_sequential=true, enabling sequential memory access optimizations when retrieving multiple vectors.

lib/segment/src/vector_storage/chunked_vector_storage.rs (2)

21-21: Sequential vector retrieval API added to trait.

This extends the trait interface to support optimized sequential access patterns.


51-52: Sequential batch retrieval API added to trait with proper documentation.

The method documentation maintains consistency with the existing get_many method documentation.

lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (2)

313-319: Implementation of sequential access for multi-vector storage.

The implementation correctly falls back to the non-sequential version with a helpful comment explaining why no optimization is done for in-memory storage.


364-366: Implementation of sequential vector access for standard vector interface.

The method correctly delegates to the existing implementation since no sequential optimization is possible for this in-memory storage type.

lib/segment/src/vector_storage/in_ram_persisted_vectors.rs (1)

48-50: Sequential access delegated to underlying storage.

The implementation correctly forwards to the sequential access method of the underlying mmap storage.

lib/segment/src/segment_constructor/batched_reader.rs (1)

94-99: Buffer contents beyond the last batch retain large vectors

On the final iteration where points_to_insert.len() % BATCH_SIZE != 0,
slots that are no longer used keep CowVectors from the previous batch,
holding references and memory longer than needed.

Calling buffer.truncate(batch_len) after filling (or switching to the
push pattern suggested above) releases them earlier and reduces peak
RSS.

Comment on lines +122 to +140
/// Returns None if key is not found
fn get_multi_opt_sequential(
&self,
key: PointOffsetType,
) -> Option<TypedMultiDenseVectorRef<T>> {
self.offsets
.get_sequential(key as VectorOffsetType)
.and_then(|mmap_offset| {
let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty");
self.vectors.get_many_sequential(
mmap_offset.offset as VectorOffsetType,
mmap_offset.count as usize,
)
})
.map(|flattened_vectors| TypedMultiDenseVectorRef {
flattened_vectors,
dim: self.vectors.dim(),
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential panic on empty slice & duplication with get_multi_opt

get_multi_opt_sequential unwraps the first element of mmap_offset with expect, identical to the non-sequential variant.
If offsets.get_sequential(..) returns an empty slice (it returns Some(&[]) when the key exists but the chunk is present yet empty), the call to first() will yield None and we will panic even though the function signature promises Option.

Consider defensive handling and re-using the existing implementation to keep behaviour consistent and avoid code duplication:

-            .and_then(|mmap_offset| {
-                let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty");
-                self.vectors.get_many_sequential(
-                    mmap_offset.offset as VectorOffsetType,
-                    mmap_offset.count as usize,
-                )
-            })
+            .and_then(|mmap_offset| mmap_offset.first().and_then(|off| {
+                self.vectors
+                    .get_many_sequential(off.offset as VectorOffsetType, off.count as usize)
+            }))

(You might even delegate to get_multi_opt when the fast path fails.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Returns None if key is not found
fn get_multi_opt_sequential(
&self,
key: PointOffsetType,
) -> Option<TypedMultiDenseVectorRef<T>> {
self.offsets
.get_sequential(key as VectorOffsetType)
.and_then(|mmap_offset| {
let mmap_offset = mmap_offset.first().expect("mmap_offset must not be empty");
self.vectors.get_many_sequential(
mmap_offset.offset as VectorOffsetType,
mmap_offset.count as usize,
)
})
.map(|flattened_vectors| TypedMultiDenseVectorRef {
flattened_vectors,
dim: self.vectors.dim(),
})
}
/// Returns None if key is not found
fn get_multi_opt_sequential(
&self,
key: PointOffsetType,
) -> Option<TypedMultiDenseVectorRef<T>> {
self.offsets
.get_sequential(key as VectorOffsetType)
.and_then(|mmap_offset| mmap_offset.first().and_then(|off| {
self.vectors
.get_many_sequential(off.offset as VectorOffsetType, off.count as usize)
}))
.map(|flattened_vectors| TypedMultiDenseVectorRef {
flattened_vectors,
dim: self.vectors.dim(),
})
}

Comment on lines +64 to +66
/// Get the vector by the given key with potential optimizations for sequential reads.
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Provide a default fallback for get_vector_sequential

All concrete storages had to implement a trivial
get_vector_sequential that just calls get_vector.
Adding a default implementation inside the trait eliminates boiler-plate
and prevents future breakage when a new storage is added but the
sequential fast-path is not important.

-    fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector;
+    fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector {
+        // Fallback to the random-access version. Storages can override for
+        // true sequential I/O optimisation.
+        self.get_vector(key)
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Get the vector by the given key with potential optimizations for sequential reads.
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector;
/// Get the vector by the given key with potential optimizations for sequential reads.
fn get_vector_sequential(&self, key: PointOffsetType) -> CowVector {
// Fallback to the random-access version. Storages can override for
// true sequential I/O optimisation.
self.get_vector(key)
}

Comment on lines +157 to +158
fn get_multi_opt_sequential(&self, key: PointOffsetType)
-> Option<TypedMultiDenseVectorRef<T>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same fallback for get_multi_opt_sequential

For symmetry with the dense case we can avoid duplicating “no-op”
implementations in every in-memory multi-vector storage:

-    fn get_multi_opt_sequential(&self, key: PointOffsetType)
-    -> Option<TypedMultiDenseVectorRef<T>>;
+    fn get_multi_opt_sequential(
+        &self,
+        key: PointOffsetType,
+    ) -> Option<TypedMultiDenseVectorRef<T>> {
+        self.get_multi_opt(key)
+    }

@generall generall requested a review from JojiiOfficial May 9, 2025 11:20
@generall generall merged commit aa95b9d into dev May 9, 2025
17 checks passed
@generall generall deleted the batch-reader-for-segment-construction branch May 9, 2025 14:24
generall added a commit that referenced this pull request May 22, 2025
* naive implementation of batched reader for segment construction

* Make reads sequential in BatchedVectorReader (#6508)

* Add sequential reads and improve BatchedVectorReader

* implement get_many_sequential

---------

Co-authored-by: generall <andrey@vasnetsov.com>

* implement get_many_sequential

---------

Co-authored-by: Jojii <15957865+JojiiOfficial@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants