Skip to content

Conversation

coszio
Copy link
Contributor

@coszio coszio commented May 12, 2025

(sorry for the huge PR)

This posting list implementation should be able to replace (at least) the current one used in full-text index, and extend it so that it is possible to implement mmap friendly phrase-matching.

The main idea is that users of this crate can implement one of these two traits for their types:

  • SizedValue - which is just a marker for a Sized + Copy type
  • VarSizedValue - which needs the type to implement functions to serialize and deserialize into/from bytes.

And just with this, get the functionality of a compressed posting list, where the ids are compressed, and have a value associated to each id.
If there is no value associated, the () (unit) type is used, and there should be no performance nor space penalty from this.

The visitor interface currently only includes contains(point_id) and get_by_offset(offset) methods. But it can also be turned into an iterator.

@coszio coszio force-pushed the generalize-compressed-posting-list branch 2 times, most recently from b683f5c to da4f3a9 Compare May 15, 2025 04:29
@coszio coszio marked this pull request as ready for review May 15, 2025 21:31
@coszio coszio changed the title [WIP] Generalize compressed posting list Generalize compressed posting list May 15, 2025
@coszio
Copy link
Contributor Author

coszio commented May 15, 2025

@coderabbitai review

@qdrant qdrant deleted a comment from coderabbitai bot May 15, 2025
@coszio coszio requested a review from Copilot May 15, 2025 21:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR generalizes the compressed posting list implementation so it can support both fixed-size and variable-size values while enabling mmap-friendly phrase matching.

  • Introduces a new visitor, view, and iterator for the posting list.
  • Implements a unified builder pattern with a ValueHandler trait including distinct implementations for fixed-size and variable-size values.
  • Adds extensive tests to validate the new behavior.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lib/posting_list/src/visitor.rs Implements a visitor to traverse the posting list using cached chunks
lib/posting_list/src/view.rs Provides a view abstraction for decompression and search operations
lib/posting_list/src/value_handler.rs Defines the ValueHandler trait with fixed-size and variable-size implementations
lib/posting_list/src/tests.rs Adds test cases to compare the posting list against a sorted vector
lib/posting_list/src/posting_list.rs Implements posting list view and visitor interface
lib/posting_list/src/lib.rs Exposes modules and re-exports key types for public use
lib/posting_list/src/iterator.rs Implements an iterator over the posting list using the visitor
lib/posting_list/src/builder.rs Provides a unified builder interface for different value types
lib/posting_list/Cargo.toml & root Cargo.toml Updates dependency versions and includes the new posting_list module
Comments suppressed due to low confidence (1)

lib/posting_list/src/tests.rs:31

  • Missing import for std::mem::{size_of, size_of_val}. Please add this at the top of the file to ensure the tests compile correctly.
let chunks_size = size_of_val(&posting_list.chunks[0]);

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (7)
lib/posting_list/Cargo.toml (2)

6-7: Edition 2024 requires Rust ≥ 1.78 — verify MSRV

The root crate advertises rust-version = "1.85", which is compatible, but if posting_list ever gets compiled stand-alone (publish toggled on, or users cherry-pick the crate) the MSRV will implicitly be the first stable supporting edition 2024.
Consider:

- edition = "2024"
+ edition = "2024"          # requires rustc >= 1.78
+ rust-version = "1.85"     # keep in sync with workspace

12-15: Expose feature flags on transitive deps

bitpacking is declared with workspace default features.
If any future optimisation (e.g. simd) must be enabled/disabled, forwarding a crate-level feature avoids workspace-wide coupling:

[features]
default = []
simd = ["bitpacking/simd"]

Nothing to change now, just flagging for maintainability.

lib/posting_list/src/lib.rs (1)

18-23: SizedValue blanket impl missing for all Copy primitives

Right now only (), u32, u64 implement SizedValue.
If callers want to attach e.g. i32 or f32 as fixed-size payloads they must write their own impls.

Consider a blanket implementation:

-pub trait SizedValue: Sized + Copy + std::fmt::Debug {}
+pub trait SizedValue: Sized + Copy + std::fmt::Debug {}
+
+impl<T> SizedValue for T where T: Sized + Copy + std::fmt::Debug {}

to make the marker fully generic.
If the intention is to whitelist only well-tested types, please add a comment explaining the rationale.

lib/posting_list/src/posting_list.rs (1)

61-80: Expose the view as slices instead of &Vec<_> references

PostingList::view() currently copies the &Vec<T> references verbatim.
Taking the slice (&[T]) instead of the whole container gives read-only semantics, shrinks the fat-pointer size, and clearly communicates that the view is immutable.

-            id_data,
-            chunks,
+            id_data: id_data.as_slice(),
+            chunks: chunks.as_slice(),

(Repeat for var_size_data, remainders.)

Changing the field types in PostingListView to slices should be mechanical and will not affect callers.

lib/posting_list/src/iterator.rs (1)

9-13: current_id is never read – remove or expose a getter

The iterator updates current_id on every next() call but never uses it internally nor exposes it publicly. Dead fields add maintenance cost.

• If callers need the last id, provide a getter.
• Otherwise, drop the field to save one word per iterator.

lib/posting_list/src/visitor.rs (1)

52-62: Minor perf: avoid decompressing on every contains call

contains always calls self.decompressed_chunk(chunk_index) which internally checks and potentially decompresses the chunk. When scanning many ids in the same chunk this check is repeated.

Cache the result locally:

let decompressed = self.decompressed_chunk(chunk_index);
decompressed.binary_search(&id).is_ok()

This saves one Option comparison per lookup and makes the control flow slightly clearer.

lib/posting_list/src/builder.rs (1)

60-73: Avoid recomputing compression metadata and potential off-by-one bugs

chunk_bits and chunk_size are computed here, but the same values are recomputed in the second loop (lines 83–92). Besides the extra cycles, any future change in the computation formula risks the two places diverging silently.

Capture them once and reuse:

-            let chunk_bits = bitpacker.num_bits_strictly_sorted(initial.checked_sub(1), chunk_ids);
-            let chunk_size = BitPackerImpl::compressed_block_size(chunk_bits);
+            let chunk_bits =
+                bitpacker.num_bits_strictly_sorted(initial.checked_sub(1), chunk_ids);
+            let chunk_size =
+                BitPackerImpl::compressed_block_size(chunk_bits);
+
+            // Persist so we can use them in the compression loop without
+            // re-deriving.
+            let meta = (chunk_bits as u8, chunk_size as u32);

Store the (bits, size) tuple inside PostingChunk (or a side vector) and read it back in the compression loop.
This reduces cognitive load and guarantees consistency.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73e7b17 and bccc2ce.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • Cargo.toml (2 hunks)
  • lib/posting_list/Cargo.toml (1 hunks)
  • lib/posting_list/src/builder.rs (1 hunks)
  • lib/posting_list/src/iterator.rs (1 hunks)
  • lib/posting_list/src/lib.rs (1 hunks)
  • lib/posting_list/src/posting_list.rs (1 hunks)
  • lib/posting_list/src/tests.rs (1 hunks)
  • lib/posting_list/src/value_handler.rs (1 hunks)
  • lib/posting_list/src/view.rs (1 hunks)
  • lib/posting_list/src/visitor.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/posting_list/src/posting_list.rs (3)
lib/posting_list/src/view.rs (1)
  • visitor (30-32)
lib/posting_list/src/iterator.rs (1)
  • new (16-22)
lib/posting_list/src/visitor.rs (1)
  • new (22-28)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (5)
Cargo.toml (2)

254-262: Workspace member added – remember to run the full CI matrix

Adding lib/posting_list to [workspace].members means all existing CI jobs (clippy, tests, doc, MSRV) will now try to build it.
Make sure the CI config (*.github/workflows, Makefile, etc.) does not exclude the new crate; otherwise the first green build could be misleading.


164-170:

❓ Verification inconclusive

Pinning bitpacking = "0.9.2": please double-check for CVEs & feature mismatches

Good call adding the bitpacking crate, but compressed-integer crates occasionally receive security / correctness fixes (e.g. undefined behaviour on malformed input).
Before merging, please verify that 0.9.2 is still the latest non-yanked version and that no RUSTSEC advisories apply.


🏁 Script executed:

#!/usr/bin/env bash
# Check for advisories & newer versions of bitpacking
cargo install --locked cargo-audit >/dev/null 2>&1
cargo audit -d bitpacking

Length of output: 160


Pinning bitpacking = "0.9.2": please manually verify CVEs & version status

Good call adding the bitpacking crate, but compressed‐integer libraries sometimes receive security or correctness fixes (e.g. UB on malformed input). Before merging:

  • Confirm that 0.9.2 is still the latest non-yanked version on crates.io
  • Ensure no RUSTSEC advisories apply to bitpacking (e.g. by running cargo audit -d bitpacking locally)

File: Cargo.toml
Lines: approximately 164–170

lib/posting_list/src/builder.rs (1)

81-93: Integer-math relies on perfect divisibility – add debug assertions

chunk_bits is recovered via

let chunk_bits = compressed_size * 8 / CHUNK_SIZE;

If an invariant ever changes (e.g. CHUNK_SIZE != 128 or compressed_block_size
starts adding padding bytes), the division will truncate and silently corrupt
the decompressed IDs. A cheap safeguard:

-            let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_SIZE;
+            debug_assert_eq!(
+                (compressed_size * 8) % CHUNK_SIZE,
+                0,
+                "compressed size {compressed_size} is not a multiple of CHUNK_SIZE={CHUNK_SIZE}"
+            );
+            let chunk_bits = compressed_size * 8 / CHUNK_SIZE;
lib/posting_list/src/tests.rs (1)

31-37: ⚠️ Potential issue

Missing std::mem imports break compilation

size_of / size_of_val are used but never brought into scope.

+use std::mem::{size_of, size_of_val};

Without the import the test crate will not compile.

⛔ Skipped due to learnings
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
lib/posting_list/src/view.rs (1)

128-138: start_chunk precondition can panic in debug builds

debug_assert!(id >= chunks_slice[0].initial_id); assumes that the caller
passes a start_chunk that is not ahead of the sought id. If an
up-stream optimisation guesses a too-large start_chunk, the assertion will
trigger in debug but will silently return wrong results in release.

Either document the contract explicitly or guard defensively:

-        debug_assert!(id >= chunks_slice[0].initial_id);
+        if cfg!(debug_assertions) {
+            assert!(
+                id >= chunks_slice[0].initial_id,
+                "`start_chunk` ({start_chunk}) is past the target id ({id})"
+            );
+        }

Comment on lines +74 to +83
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value
where
N: Fn() -> Option<Self::Sized>,
{
let range = match next_sized_value() {
Some(next_value) => sized_value as usize..next_value as usize,
None => sized_value as usize..var_data.len(),
};
V::from_bytes(&var_data[range])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Missing bounds checks could panic on corrupted data

get_value trusts both offsets to lie within var_data.
A single‐bit flip in storage would cause a slice-panic and potentially abort the process.

Add debug or release-mode checks:

-        let range = match next_sized_value() {
+        debug_assert!(
+            sized_value as usize <= var_data.len(),
+            "Corrupted posting list: start offset {} ≥ data length {}",
+            sized_value, var_data.len()
+        );
+
+        let range = match next_sized_value() {
             Some(next_value) => sized_value as usize..next_value as usize,
             None => sized_value as usize..var_data.len(),
         };

A graceful error (Result) would be even better, but a debug assert at minimum prevents undefined behaviour.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value
where
N: Fn() -> Option<Self::Sized>,
{
let range = match next_sized_value() {
Some(next_value) => sized_value as usize..next_value as usize,
None => sized_value as usize..var_data.len(),
};
V::from_bytes(&var_data[range])
}
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value
where
N: Fn() -> Option<Self::Sized>,
{
debug_assert!(
sized_value as usize <= var_data.len(),
"Corrupted posting list: start offset {} ≥ data length {}",
sized_value,
var_data.len()
);
let range = match next_sized_value() {
Some(next_value) => sized_value as usize..next_value as usize,
None => sized_value as usize..var_data.len(),
};
V::from_bytes(&var_data[range])
}
🤖 Prompt for AI Agents
In lib/posting_list/src/value_handler.rs around lines 74 to 83, the get_value
function slices var_data using offsets without bounds checks, risking a panic if
data is corrupted. Add debug assertions to verify that sized_value and
next_sized_value offsets are within var_data bounds before slicing. This
prevents undefined behavior by catching invalid offsets early. For a more robust
fix, consider changing the function to return a Result to handle errors
gracefully.

This comment was marked as outdated.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We reviewed this in a call.

Could you add a test case that checks out of bound values? Below the lower and above the upper bound, and possibly also a value that is in-bound but doesn't exist.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
lib/posting_list/src/view.rs (1)

108-123: Correct handling of chunk decompression

The decompression logic is well-implemented. The use of checked_sub(1) without unwrapping on line 118 is intentional and correct, as the BitPacker::decompress_strictly_sorted function expects an Option<PointOffsetType> as its first parameter.

🧹 Nitpick comments (1)
lib/posting_list/src/view.rs (1)

153-191: Consider adding runtime validation for assumptions in production builds

The find_chunk method includes debug assertions to validate assumptions about the ID being within range, but these are only checked in debug builds. For a public method, consider adding runtime checks in production builds as well.

pub fn find_chunk(&self, id: PointOffsetType, start_chunk: Option<usize>) -> Option<usize> {
    let remainders = self.remainders;
    let chunks = self.chunks;

    if chunks.is_empty() {
        return None;
    }

    // check if id is in the remainders list
    if remainders.first().is_some_and(|elem| id >= elem.id) {
        return None;
    }

    let start_chunk = start_chunk.unwrap_or(0);
    let chunks_slice = &chunks[start_chunk..];
    if chunks_slice.is_empty() {
        return None;
    }

+   // Validate assumptions in production builds as well
+   if id < chunks_slice[0].initial_id || self.last_id.is_some_and(|last_id| id > last_id) {
+       return None;
+   }

    // No need to check if id is under range of posting list,
    // this function assumes it is within the range
    debug_assert!(id >= chunks_slice[0].initial_id);
    debug_assert!(self.last_id.is_some_and(|last_id| id <= last_id));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96f4c23 and c567e61.

📒 Files selected for processing (2)
  • lib/posting_list/src/view.rs (1 hunks)
  • lib/posting_list/src/visitor.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/posting_list/src/visitor.rs
🧰 Additional context used
🧠 Learnings (1)
lib/posting_list/src/view.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.018Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: lint
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: test-consistency
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: test-consensus-compose
🔇 Additional comments (11)
lib/posting_list/src/view.rs (11)

1-21: Well-designed struct with clear separation of concerns and proper use of generics

The PostingListView struct is well designed, using type parameters and lifetimes appropriately. Using a phantom marker for the handler type is a good pattern for maintaining type safety without adding runtime overhead.


23-29: Good design for component extraction

The PostingListComponents struct is a clean way to bundle related components for external use, improving API ergonomics by avoiding multiple return values.


31-47: Clean implementation for ID-only view construction

The from_ids_components method provides a clear way to create an ID-only view, with appropriate initialization of all fields including the empty var_size_data slice.


49-65: Good specialized implementation for sized values

This implementation provides specialized construction for posting lists with sized values, which helps maintain type safety while reusing common patterns.


67-70: Concise visitor creation method

The visitor method cleanly transfers ownership of the view to create a new visitor, which is an appropriate pattern for this use case.


72-89: Effective use of destructuring for component extraction

The components method uses Rust's destructuring syntax effectively to extract all relevant fields from the view.


91-106: Generic construction from components

The from_components method provides a generic way to construct views from components, which works well with the specialized methods for specific use cases.


124-130: Good separation of checked and unchecked accessor methods

Having both checked (sized_values) and unchecked (sized_values_unchecked) methods with appropriate visibility modifiers is a good practice. The unchecked version is correctly marked as pub(crate) to restrict its usage to the crate where safety can be enforced.


132-147: Thorough range checking implementation

The is_in_range method thoroughly checks boundary conditions and handles empty lists gracefully using pattern matching with early returns.


193-197: Efficient binary search for remainders

The search_in_remainders method effectively uses Rust's binary search capabilities and appropriately returns an Option type.


200-207: Straightforward length and emptiness checks

The len and is_empty methods are implemented correctly and provide a clear interface for checking the size of the posting list.

@timvisee
Copy link
Member

Thanks for the additional tests! 🙏

@coszio coszio merged commit 2ca170c into dev May 22, 2025
17 checks passed
@coszio coszio deleted the generalize-compressed-posting-list branch May 22, 2025 16:40
generall pushed a commit that referenced this pull request May 22, 2025
* use chunks_exact in compressed sparse posting list builder

* generalized impls

make both impls similar

new design

make impls similar

generalize impls

* separate into files in new crate

* additional builder fns

new fn for builder

* retrieve current iterator position

* simplify traits

* restructure

* add iterator

* clippy

add contains fn

* fix find_chunk

* move builders to builder

* revamp generics and traits

* add model test vs var-sized posting list

* fmt

* clippy

* improve traits and generics

* fix get_by_offset

* generalize tests

* improve test

* restructure view into a new file

* to and from components

* value handler takes closure to prevent perf penalty

* clippy

* remove unused deps

* revert changes in sparse index

* self review nits

* reword

* clippy

* edit comment

* lock CHUNK_LEN, but assert it is synced with BitPacker4x

* remove Sized constraint in iterator

* rename VarSized* to Unsized*

And add more from_components helpers

* avoid intermediate allocation when writing UnsizedValue

* checked u32 conversions

* fix test

* clippy

* Add debug assertion, assert that point ID is in range

* Bound check in get_by_offset

* improve tests to check against various lengths and offsets

* clippy 😤

---------

Co-authored-by: timvisee <tim@visee.me>
n0x29a pushed a commit that referenced this pull request May 23, 2025
* use chunks_exact in compressed sparse posting list builder

* generalized impls

make both impls similar

new design

make impls similar

generalize impls

* separate into files in new crate

* additional builder fns

new fn for builder

* retrieve current iterator position

* simplify traits

* restructure

* add iterator

* clippy

add contains fn

* fix find_chunk

* move builders to builder

* revamp generics and traits

* add model test vs var-sized posting list

* fmt

* clippy

* improve traits and generics

* fix get_by_offset

* generalize tests

* improve test

* restructure view into a new file

* to and from components

* value handler takes closure to prevent perf penalty

* clippy

* remove unused deps

* revert changes in sparse index

* self review nits

* reword

* clippy

* edit comment

* lock CHUNK_LEN, but assert it is synced with BitPacker4x

* remove Sized constraint in iterator

* rename VarSized* to Unsized*

And add more from_components helpers

* avoid intermediate allocation when writing UnsizedValue

* checked u32 conversions

* fix test

* clippy

* Add debug assertion, assert that point ID is in range

* Bound check in get_by_offset

* improve tests to check against various lengths and offsets

* clippy 😤

---------

Co-authored-by: timvisee <tim@visee.me>
@coderabbitai coderabbitai bot mentioned this pull request Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants