Skip to content

Conversation

coszio
Copy link
Contributor

@coszio coszio commented May 20, 2025

builds on top of #6528

This PR:

  • adds a few improvements to generic posting list
  • swaps the implementation for a compressed posting list in full text index

todo

  • make sure it can reuse structures created from the previous implementation
  • add hw measurement (in the next PR)
  • remove previous implementation (since the compatibility test depends from this, we can postpone it to do it later)

@coszio coszio force-pushed the full-text-with-generic-posting-list branch from eadb3fb to e708e96 Compare May 21, 2025 23:04
Comment on lines 31 to 33
let smallest_posting_idx = postings
.iter()
.enumerate()
.min_by_key(|(_idx, posting)| posting.len())
.map(|(idx, _posting)| idx)
.unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but have we considered sorting all posting lists, so that the smallest posting lists are checked first?

@coszio coszio force-pushed the full-text-with-generic-posting-list branch from 5272bfc to c2d4339 Compare May 22, 2025 15:22
Base automatically changed from generalize-compressed-posting-list to dev May 22, 2025 16:40
@coszio coszio force-pushed the full-text-with-generic-posting-list branch from c2d4339 to 13d9138 Compare May 22, 2025 18:31
@coszio coszio changed the title [WIP] Full text index with generic posting list Full text index with generic posting list May 22, 2025
@coszio coszio marked this pull request as ready for review May 22, 2025 18:39
@coszio coszio requested review from timvisee and generall May 22, 2025 18:39
Copy link
Contributor Author

@coszio coszio May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a copy of mmap_postings.rs before the changes in this PR.

This file, along with the entire compressed_posting module can safely be deleted later. For now we keep them to ensure compatibility of file format

Copy link
Contributor

coderabbitai bot commented May 22, 2025

📝 Walkthrough

"""

Walkthrough

This change refactors the posting list implementation in the full-text index system by introducing a generic, zero-copy, little-endian-compliant posting list structure using the zerocopy crate. It replaces previous custom compressed posting list types and readers with new generic types (PostingList, PostingListView, RemainderPosting, etc.) and updates all related modules to use these unified abstractions. Iterator logic is enhanced with new methods and trait bounds, and public APIs are adjusted for broader export and trait compatibility. The memory-mapped postings implementation is updated to use these new types, and compatibility tests are introduced to ensure correctness. Several methods for converting posting lists to vectors or checking emptiness are removed or replaced.

Possibly related PRs

  • Generalize compressed posting list #6528: Introduces and generalizes the compressed posting list abstraction, which this PR extends with zerocopy integration, iterator enhancements, and API refinements, directly modifying the same core structures and modules.

Suggested reviewers

  • timvisee
  • generall
    """

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70001c7 and 05a739d.

📒 Files selected for processing (1)
  • lib/posting_list/src/posting_list.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/posting_list/src/posting_list.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)

1-17: ⚠️ Potential issue

Missing size_of import mirrors the issue in the legacy file

size_of is referenced dozens of times but never brought into scope.

 use std::io::Write;
 use std::path::PathBuf;
+use std::mem::size_of;   // ← add
🧹 Nitpick comments (8)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)

51-59: Consider using test assertion libraries for better error messages.

While the current assertion is functional, consider using a testing library like pretty_assertions for more readable output when assertions fail, especially when comparing complex data structures.

- assert!(
-   id_model == id_old && id_model == elem_new.id,
-   "Mismatch at token_id {token_id}: offset: {offset}, old: {id_old}, new: {}, model: {id_model}",
-   elem_new.id
- );
+ use pretty_assertions::assert_eq;
+ assert_eq!(id_model, id_old, "Model vs Old mismatch at token_id {token_id}, offset: {offset}");
+ assert_eq!(id_model, elem_new.id, "Model vs New mismatch at token_id {token_id}, offset: {offset}");
lib/posting_list/src/visitor.rs (1)

159-163: Unnecessary temporary when re-emitting remainder element

You allocate id then immediately call id.get(); the binding is redundant.

-let id = e.id;
-
-PostingElement { id: id.get(), value }
+PostingElement {
+    id: e.id.get(),
+    value,
+}
lib/posting_list/src/iterator.rs (1)

70-77: Avoid double-cloning PostingElement in next()

next() currently clones the element once to feed current_elem and again to hand the value to the caller (Option::inspect + clone).
For H::Value types larger than zero bytes this is unnecessary churn.

A lighter variant keeps a reference or performs a single move:

-        let next_opt = self.visitor.get_by_offset(self.offset).inspect(|_| {
-            self.offset += 1;
-        });
-        self.current_elem = next_opt.clone();
-        next_opt
+        let next_opt = self.visitor.get_by_offset(self.offset);
+        if let Some(ref elem) = next_opt {
+            self.offset += 1;
+            self.current_elem = Some(elem.clone());
+        }
+        next_opt

This preserves semantics while eliminating one clone per element.

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)

24-30: Rename the filter parameter for clarity

Inside the iterator pipeline the parameter filter is invoked inside a .filter(...) adaptor, which easily leads to “filter inside filter” mental gymnastics.
Consider renaming the argument to something like doc_filter or pre_filter to make the intent crystal-clear.

-fn intersect_compressed_postings_iterator<'a>(
-    mut postings: Vec<IdsPostingListView<'a>>,
-    filter: impl Fn(PointOffsetType) -> bool + 'a,
+fn intersect_compressed_postings_iterator<'a>(
+    mut postings: Vec<IdsPostingListView<'a>>,
+    doc_filter: impl Fn(PointOffsetType) -> bool + 'a,

…and update the usage accordingly.

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (1)

164-173: Builder loop could be replaced by a zero-copy or bulk constructor

The conversion allocates a new PostingBuilder for every posting and pushes IDs one-by-one:

for id in posting.iter() {
    builder.add_id(id);
}

If PostingBuilder (or IdsPostingList) already provides a from_iter, extend, or zero-copy from_sorted_slice API, using it will:

  1. Avoid the per-ID method dispatch overhead.
  2. Allow the builder to pre-allocate the exact capacity, reducing reallocations.
  3. Shorten the conversion code.

Consider exposing such an API in posting_list and using it here.

lib/posting_list/src/posting_list.rs (1)

105-110: iter() forces Clone unnecessarily

Requiring H::Value: Clone prevents iterating over lists whose values are non-cloneable (e.g., Box<dyn …>).
Consider yielding &H::Value (or Cow) from PostingIterator so that cloning is not mandatory.
This widens API usability without performance drawbacks for Copy types.

lib/posting_list/src/view.rs (2)

68-78: IntoIterator imposes a Clone bound on values – is it necessary?

Requiring H::Value: Clone propagates to every caller that wants to iterate
over a PostingListView, even if they only need borrowed data. If the value
type is large (e.g. a weighted payload), the extra Clone can be costly or
impossible.

If the iterator already yields references (or Copy types), you can drop the
bound and avoid the constraint:

-impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H>
-where
-    H::Value: Clone,
+impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H>

85-96: to_owned performs full allocations – consider zero-copy alternatives

Cloning id_data, var_size_data, and the entire chunks/remainders
arrays can be expensive for large posting lists. If the only goal is to
obtain an owned version that lives longer than the source slice, a cheap
Arc<[u8]>/Arc<[T]> (or Box<[_]>) could provide ownership without data
duplication.

This will reduce peak memory and improve conversion speed for large indexes.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ca170c and 13d9138.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • lib/posting_list/Cargo.toml (1 hunks)
  • lib/posting_list/src/builder.rs (4 hunks)
  • lib/posting_list/src/iterator.rs (2 hunks)
  • lib/posting_list/src/lib.rs (2 hunks)
  • lib/posting_list/src/posting_list.rs (4 hunks)
  • lib/posting_list/src/tests.rs (1 hunks)
  • lib/posting_list/src/value_handler.rs (3 hunks)
  • lib/posting_list/src/view.rs (9 hunks)
  • lib/posting_list/src/visitor.rs (5 hunks)
  • lib/segment/Cargo.toml (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs (0 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/posting_list.rs (0 hunks)
  • lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (5 hunks)
💤 Files with no reviewable changes (2)
  • lib/segment/src/index/field_index/full_text_index/posting_list.rs
  • lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/posting_list/src/lib.rs (2)
lib/posting_list/src/posting_list.rs (2)
  • view (80-98)
  • visitor (100-103)
lib/posting_list/src/view.rs (1)
  • visitor (81-83)
lib/posting_list/src/visitor.rs (3)
lib/posting_list/src/posting_list.rs (1)
  • is_empty (116-118)
lib/posting_list/src/view.rs (1)
  • is_empty (230-232)
lib/segment/src/index/field_index/full_text_index/posting_list.rs (1)
  • is_empty (34-36)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4)
lib/posting_list/src/posting_list.rs (2)
  • visitor (100-103)
  • iter (105-110)
lib/posting_list/src/view.rs (1)
  • visitor (81-83)
lib/posting_list/src/visitor.rs (1)
  • contains (100-118)
lib/segment/src/index/field_index/full_text_index/posting_list.rs (2)
  • contains (39-41)
  • iter (44-46)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (26)
lib/posting_list/Cargo.toml (1)

15-15: Good addition of zerocopy dependency.

Adding the zerocopy crate as a dependency aligns with the PR's goal of implementing a generic, zero-copy posting list structure. This will enable more efficient memory layout guarantees and safer serialization/deserialization operations.

lib/segment/Cargo.toml (1)

120-120: Appropriate integration of posting_list crate.

Adding the posting_list crate as a dependency is a good approach for integrating the new generic posting list implementation. This dependency enables the segment library to use the improved posting list abstractions and data structures.

lib/segment/src/index/field_index/full_text_index/mod.rs (1)

11-12: Good approach for maintaining backward compatibility.

Adding the #[cfg(test)] attribute to the compressed_posting module is a sound strategy during this transition phase. It ensures the old implementation is available for compatibility tests while removing it from production code, aligning with the PR objective of replacing the compressed posting list implementation while maintaining compatibility.

lib/posting_list/src/tests.rs (1)

112-112: Appropriate trait bound addition for testing.

Adding the std::fmt::Debug trait bound to H::Value in the test function is beneficial for producing meaningful assertion failure messages. This aligns with the broader changes where Debug was removed from production code trait bounds but kept for testing purposes.

lib/posting_list/src/value_handler.rs (4)

14-14: Appropriate removal of Debug constraint.

Removing the Debug trait bound from Value type makes the ValueHandler trait more flexible and less restrictive, which is good practice for generic abstractions.


16-16: Appropriate removal of Debug constraint.

Similar to the Value type, removing the Debug bound from Sized type increases flexibility while maintaining the essential Sized + Copy bounds needed for the type's functionality.


34-34: Good addition of trait derivations.

Adding #[derive(Debug, Clone, Copy)] to SizedHandler ensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.


54-54: Good addition of trait derivations.

Adding #[derive(Debug, Clone, Copy)] to UnsizedHandler ensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.

lib/posting_list/src/lib.rs (3)

18-18: Appropriate removal of Debug constraint.

Removing the Debug trait bound from SizedValue aligns with the changes in value_handler.rs, creating a more flexible trait that focuses only on the essential requirements (Sized + Copy).


24-24: Appropriate removal of Debug constraint.

Removing the Debug trait bound from UnsizedValue makes this trait more flexible and consistent with the changes to other traits in this refactoring.


51-54: Good expansion of public API.

Exporting additional types (RemainderPosting, ValueHandler, PostingListComponents, and PostingVisitor) appropriately expands the public API surface to support the new generic posting list architecture. This enables users of the library to work with these abstractions directly.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)

13-13: Good replacement of ChunkReader import.

Replacing the import of ChunkReader with IdsPostingListView correctly implements the refactoring goal of using the generic posting list instead of the specialized compressed posting list.


24-28: Good approach for maintaining backward compatibility.

Adding test-only modules for old implementations is a prudent approach for ensuring backward compatibility during this refactoring. This aligns with the PR objective of maintaining compatibility with existing data structures.


137-137: Appropriate return type update.

Changing the return type of iter_postings to impl Iterator<Item = Option<IdsPostingListView<'a>>> correctly reflects the refactoring to use the generic posting list types.


282-283: Method call updated correctly.

Changing from a method like .reader() to .visitor() properly adapts to the new API of the generic posting list. The visitor() method provides access to the functionality needed to check if a posting list contains a specific point ID.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)

13-61: Well-designed compatibility test.

This test effectively verifies compatibility between old and new posting list implementations by:

  1. Generating random sets of document IDs with various lengths
  2. Creating posting lists using the old implementation
  3. Writing them to a file using the old format
  4. Reading them back using both old and new implementations
  5. Verifying that both implementations return identical IDs

This comprehensive test is crucial for ensuring backward compatibility during this refactoring.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)

289-293: Borrow checker issue mirrors the previous one

new_posting.iter() yields PostingElement; you then map the id and call
orig_posting.contains(point_id), which is fine.
However the id field is a U32 wrapper in the new code. Call .get() to
obtain a PointOffsetType explicitly and make the intent obvious:

-    .map(|elem| elem.id)
+    .map(|elem| elem.id.get())
lib/posting_list/src/builder.rs (1)

83-87: Overflow check for PointOffsetType → U32 conversion

id is a PointOffsetType (aliased to u32 today) but the codebase has
started experimenting with u64 in places. If that change lands, a silent
wrap-around here would corrupt postings.

Add an explicit check:

-remainders.push(RemainderPosting {
-    id: U32::from(id),
+remainders.push(RemainderPosting {
+    id: U32::from(
+        u32::try_from(id).expect("point id must fit into 32 bits")
+    ),
     value,
 });
lib/posting_list/src/visitor.rs (3)

33-35: is_empty is fine but duplicate of len() == 0

The helper is handy; consider delegating to self.len() == 0 or marking the
method #[inline] to make the intention explicit.
No action required.


50-98: Edge-case: search_greater_or_equal skips the first remainder

When id is smaller than every element in the posting list
(find_chunk returns None) the remainder search starts immediately.
If the list contains chunks, search_in_remainders looks into remainders only,
so the very first element (initial id of chunk 0) is ignored.

Consider short-circuiting before remainder search:

// after `find_chunk`
if chunk_index.is_none() {
    // the first element in the list is the smallest id
    return Some(0);
}

Otherwise a query for an id below the posting list range yields None instead
of offset 0.


167-170: Extra Clone bound correctly added

The iterator now yields cloned values – the new bound is required and properly
scoped.
Looks good.

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)

42-55: Duplicate document IDs can slip through the intersection

The algorithm emits one value per occurrence in the smallest posting list.
If that list contains duplicates, the resulting stream will also contain duplicates even though an intersection is typically expected to behave like a mathematical set.

If duplicates are not meaningful for downstream consumers, deduplicate here:

let and_iter = smallest_posting_iterator
    .map(|elem| elem.id)
    .dedup() // from itertools, or roll your own simple `scan`-based dedup
    .filter(move |id| { /* unchanged */ })

Alternatively, guarantee uniqueness earlier when building IdsPostingList.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1)

1-13: ⚠️ Potential issue

size_of is used but never imported – compilation will fail

size_of::<T>() is referenced multiple times in the file, yet std::mem::size_of is not in scope.
Rust will not implicitly import it, producing an unresolved import error during build.

 use std::io::Write;
 use std::path::PathBuf;
+use std::mem::size_of;   // ← add

 // …
⛔ Skipped due to learnings
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)

181-187: Risk of silent overflow when remainder_count exceeds 255

remainder_count is stored as u8, but remainders.len() is cast without a check.
While current builders keep remainders below 128 (CHUNK_LEN), a future change could exceed this and silently wrap.

Consider changing the header field to u16 (and adjust serialization accordingly) or at least assert:

assert!(remainders.len() <= u8::MAX as usize);
lib/posting_list/src/posting_list.rs (1)

112-118: len() can over-report when the last chunk is not completely filled

len() assumes every stored chunk contains CHUNK_LEN elements:

self.chunks.len() * CHUNK_LEN + self.remainders.len()

If the builder ever stores a partially-filled chunk (e.g., for an all-full compression path in the future),
this will overestimate. Either document that each chunk is guaranteed full, or compute the actual count
by decoding the last chunk’s header/offset.

lib/posting_list/src/view.rs (1)

24-30: Type asymmetry for last_id between view and components is confusing

PostingListView stores last_id as Option<PointOffsetType> while
PostingListComponents uses Option<U32>, requiring ad-hoc conversions in
both directions. Aligning these types (e.g. using U32 everywhere) would
simplify the API and prevent accidental loss of endianness guarantees.

Comment on lines 142 to 149
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
BitPackerImpl::new().decompress_strictly_sorted(
chunk.initial_id.checked_sub(1),
&self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size],
BitPackerImpl::new().decompress_sorted(
chunk.initial_id.get(),
&self.id_data
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size],
decompressed_chunk,
chunk_bits as u8,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add overflow / range-check when recovering chunk_bits

chunk_bits is reconstructed via integer arithmetic and then cast to u8.
If the compressed size is corrupted or larger than expected, the computed
value may exceed u8::MAX, leading to silent truncation and UB-adjacent
behaviour inside bitpacking.

Consider guarding the cast in debug builds (or even release) to fail fast:

-        let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+        let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+        debug_assert!(
+            chunk_bits <= u8::MAX as usize,
+            "bit width {} exceeds u8::MAX – corrupted posting list?",
+            chunk_bits
+        );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
BitPackerImpl::new().decompress_strictly_sorted(
chunk.initial_id.checked_sub(1),
&self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size],
BitPackerImpl::new().decompress_sorted(
chunk.initial_id.get(),
&self.id_data
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size],
decompressed_chunk,
chunk_bits as u8,
);
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+ debug_assert!(
+ chunk_bits <= u8::MAX as usize,
+ "bit width {} exceeds u8::MAX – corrupted posting list?",
+ chunk_bits
+ );
BitPackerImpl::new().decompress_sorted(
chunk.initial_id.get(),
&self.id_data
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size],
decompressed_chunk,
chunk_bits as u8,
);
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs around lines 142 to 149, the calculation of
chunk_bits involves integer arithmetic and is cast to u8 without checks, risking
overflow and silent truncation if compressed_size is corrupted or too large. Add
a range check before casting chunk_bits to u8 to ensure it does not exceed
u8::MAX, and handle the error appropriately, such as by panicking in debug
builds or returning a Result error, to fail fast and avoid undefined behavior.

@coszio coszio force-pushed the full-text-with-generic-posting-list branch from 3566915 to 726975b Compare May 26, 2025 19:30
@coszio coszio force-pushed the full-text-with-generic-posting-list branch from 726975b to 1b12a92 Compare May 26, 2025 19:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/posting_list/src/view.rs (1)

143-143: Add overflow protection when calculating chunk_bits.

This calculation can overflow and cause silent truncation if compressed_size is corrupted or unexpectedly large, leading to undefined behavior in bitpacking operations.

The same overflow issue flagged in previous reviews still exists. Consider adding the suggested range check:

 let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+debug_assert!(
+    chunk_bits <= u8::MAX as usize,
+    "bit width {} exceeds u8::MAX – corrupted posting list?",
+    chunk_bits
+);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 726975b and 1b12a92.

📒 Files selected for processing (5)
  • lib/posting_list/src/iterator.rs (2 hunks)
  • lib/posting_list/src/posting_list.rs (4 hunks)
  • lib/posting_list/src/tests.rs (9 hunks)
  • lib/posting_list/src/view.rs (9 hunks)
  • lib/posting_list/src/visitor.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • lib/posting_list/src/tests.rs
  • lib/posting_list/src/posting_list.rs
  • lib/posting_list/src/iterator.rs
  • lib/posting_list/src/visitor.rs
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
lib/posting_list/src/view.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: storage-compat-test
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: lint
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (5)
lib/posting_list/src/view.rs (5)

69-79: IntoIterator implementation looks well-designed.

The IntoIterator implementation properly delegates to the visitor pattern and includes appropriate trait bounds requiring H::Value: Clone. This maintains consistency with the iterator design.


86-97: Good design choice for to_owned method.

The comment correctly explains why this isn't implemented as the ToOwned trait - the borrow relationship would be problematic. The implementation properly clones all components to create an owned PostingList.


145-153: Correct usage of zerocopy types in decompress_chunk.

The changes properly use .get() methods to access fields from zerocopy types (U32), ensuring correct little-endian byte order handling. The slice bounds are calculated correctly.


163-174: Improved type safety in ids_range method.

The return type change to RangeInclusive<u32> provides better type safety. The logic correctly handles empty posting lists and properly extracts IDs using .get() methods from zerocopy types.


184-219: Consistent zerocopy type usage in search methods.

All ID comparisons now properly use .get() to extract values from U32 types, maintaining consistency with the zerocopy migration. The binary search logic remains correct.


/// A non-owning view of [`PostingList`].
#[derive(Debug, Clone)]
#[derive(Debug)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Verify impact of removing Clone derive from PostingListView.

The Clone derive has been removed from PostingListView. This is a breaking change that may affect code depending on cloning this type.


🏁 Script executed:

#!/bin/bash
# Description: Check for code that depends on cloning PostingListView
# Expected: Find any usage of .clone() on PostingListView instances

rg -A 3 -B 3 "PostingListView.*\.clone\(\)" 
rg -A 3 -B 3 "\.clone\(\).*PostingListView"
ast-grep --pattern 'let $_ = $var.clone()' | grep -i postinglistview

Length of output: 162


No internal .clone() usage found on PostingListView
We ran ripgrep and ast-grep across the repo to locate any .clone() calls on PostingListView—none were found. Removing Clone from PostingListView does not break internal code. However, since this type is part of the public API, this is a breaking change for downstream consumers:

  • Bump the crate’s major version.
  • Document the removal of Clone in the changelog or migration guide.
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs at line 15, the Clone derive was removed from
PostingListView, which breaks downstream code relying on cloning this type. To
fix this, update the crate's major version to reflect the breaking change and
add clear documentation about the removal of Clone in the changelog or migration
guide to inform users of the API change.

@@ -8,28 +8,75 @@ use crate::visitor::PostingVisitor;

pub struct PostingIterator<'a, H: ValueHandler> {
visitor: PostingVisitor<'a, H>,
current_id: Option<PointOffsetType>,
current_elem: Option<PostingElement<H::Value>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change now requires Copy. What the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we can cache the value and avoid going into the posting list if not necessary

/// Internal structure to store the remainder of the posting list. So that IntoBytes can be derived
#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)]
#[repr(C)]
pub struct RemainderPosting<S: Sized> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is internal, can it be not pub? Or at elast pub(crate)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be public because we need to construct a PostingListView from the specific mmap postings implementation. I guess the correct description would be

Structure to store (...) The difference with PostingElement is that this is zero-copy friendly

#[repr(C)]
pub struct RemainderPosting<S: Sized> {
pub id: U32,
pub value: S,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a nitpick, but why S here and V in PostingElement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is not necessarily the actual value for unsized values, but rather whatever is used as fixed-sized value (in case of unsized values, it is a u32 pointing into var-sized-data)

@coszio coszio merged commit 7a05502 into dev May 27, 2025
17 checks passed
@coszio coszio deleted the full-text-with-generic-posting-list branch May 27, 2025 22:53
@generall generall added this to the Phrase matching milestone Jul 17, 2025
generall added a commit that referenced this pull request Jul 17, 2025
* integrate generic posting list in full-text index

* improve intersection fn

improve iterator intersection

* make old module just for test

* recover old implementation of MmapPostings for tests

* add compatibility test

* fix compression

* clippy

* nits

* check if first id is already greater

protects against the case where the iterator's current element hasn't
been extracted yet

* update current element when there is no greater or equal

* reuse visitor in test

* optimize alignment

* bound checks and refactor is_in_range

* review

* edit comment

---------

Co-authored-by: generall <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants