-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Generalize compressed posting list #6528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b683f5c
to
da4f3a9
Compare
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR generalizes the compressed posting list implementation so it can support both fixed-size and variable-size values while enabling mmap-friendly phrase matching.
- Introduces a new visitor, view, and iterator for the posting list.
- Implements a unified builder pattern with a ValueHandler trait including distinct implementations for fixed-size and variable-size values.
- Adds extensive tests to validate the new behavior.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
lib/posting_list/src/visitor.rs | Implements a visitor to traverse the posting list using cached chunks |
lib/posting_list/src/view.rs | Provides a view abstraction for decompression and search operations |
lib/posting_list/src/value_handler.rs | Defines the ValueHandler trait with fixed-size and variable-size implementations |
lib/posting_list/src/tests.rs | Adds test cases to compare the posting list against a sorted vector |
lib/posting_list/src/posting_list.rs | Implements posting list view and visitor interface |
lib/posting_list/src/lib.rs | Exposes modules and re-exports key types for public use |
lib/posting_list/src/iterator.rs | Implements an iterator over the posting list using the visitor |
lib/posting_list/src/builder.rs | Provides a unified builder interface for different value types |
lib/posting_list/Cargo.toml & root Cargo.toml | Updates dependency versions and includes the new posting_list module |
Comments suppressed due to low confidence (1)
lib/posting_list/src/tests.rs:31
- Missing import for std::mem::{size_of, size_of_val}. Please add this at the top of the file to ensure the tests compile correctly.
let chunks_size = size_of_val(&posting_list.chunks[0]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (7)
lib/posting_list/Cargo.toml (2)
6-7
: Edition 2024 requires Rust ≥ 1.78 — verify MSRVThe root crate advertises
rust-version = "1.85"
, which is compatible, but ifposting_list
ever gets compiled stand-alone (publish toggled on, or users cherry-pick the crate) the MSRV will implicitly be the first stable supporting edition 2024.
Consider:- edition = "2024" + edition = "2024" # requires rustc >= 1.78 + rust-version = "1.85" # keep in sync with workspace
12-15
: Expose feature flags on transitive deps
bitpacking
is declared with workspace default features.
If any future optimisation (e.g.simd
) must be enabled/disabled, forwarding a crate-level feature avoids workspace-wide coupling:[features] default = [] simd = ["bitpacking/simd"]Nothing to change now, just flagging for maintainability.
lib/posting_list/src/lib.rs (1)
18-23
:SizedValue
blanket impl missing for all Copy primitivesRight now only
()
,u32
,u64
implementSizedValue
.
If callers want to attach e.g.i32
orf32
as fixed-size payloads they must write their own impls.Consider a blanket implementation:
-pub trait SizedValue: Sized + Copy + std::fmt::Debug {} +pub trait SizedValue: Sized + Copy + std::fmt::Debug {} + +impl<T> SizedValue for T where T: Sized + Copy + std::fmt::Debug {}to make the marker fully generic.
If the intention is to whitelist only well-tested types, please add a comment explaining the rationale.lib/posting_list/src/posting_list.rs (1)
61-80
: Expose the view as slices instead of&Vec<_>
references
PostingList::view()
currently copies the&Vec<T>
references verbatim.
Taking the slice (&[T]
) instead of the whole container gives read-only semantics, shrinks the fat-pointer size, and clearly communicates that the view is immutable.- id_data, - chunks, + id_data: id_data.as_slice(), + chunks: chunks.as_slice(),(Repeat for
var_size_data
,remainders
.)Changing the field types in
PostingListView
to slices should be mechanical and will not affect callers.lib/posting_list/src/iterator.rs (1)
9-13
:current_id
is never read – remove or expose a getterThe iterator updates
current_id
on everynext()
call but never uses it internally nor exposes it publicly. Dead fields add maintenance cost.• If callers need the last id, provide a getter.
• Otherwise, drop the field to save one word per iterator.lib/posting_list/src/visitor.rs (1)
52-62
: Minor perf: avoid decompressing on everycontains
call
contains
always callsself.decompressed_chunk(chunk_index)
which internally checks and potentially decompresses the chunk. When scanning many ids in the same chunk this check is repeated.Cache the result locally:
let decompressed = self.decompressed_chunk(chunk_index); decompressed.binary_search(&id).is_ok()This saves one
Option
comparison per lookup and makes the control flow slightly clearer.lib/posting_list/src/builder.rs (1)
60-73
: Avoid recomputing compression metadata and potential off-by-one bugs
chunk_bits
andchunk_size
are computed here, but the same values are recomputed in the second loop (lines 83–92). Besides the extra cycles, any future change in the computation formula risks the two places diverging silently.Capture them once and reuse:
- let chunk_bits = bitpacker.num_bits_strictly_sorted(initial.checked_sub(1), chunk_ids); - let chunk_size = BitPackerImpl::compressed_block_size(chunk_bits); + let chunk_bits = + bitpacker.num_bits_strictly_sorted(initial.checked_sub(1), chunk_ids); + let chunk_size = + BitPackerImpl::compressed_block_size(chunk_bits); + + // Persist so we can use them in the compression loop without + // re-deriving. + let meta = (chunk_bits as u8, chunk_size as u32);Store the
(bits, size)
tuple insidePostingChunk
(or a side vector) and read it back in the compression loop.
This reduces cognitive load and guarantees consistency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (10)
Cargo.toml
(2 hunks)lib/posting_list/Cargo.toml
(1 hunks)lib/posting_list/src/builder.rs
(1 hunks)lib/posting_list/src/iterator.rs
(1 hunks)lib/posting_list/src/lib.rs
(1 hunks)lib/posting_list/src/posting_list.rs
(1 hunks)lib/posting_list/src/tests.rs
(1 hunks)lib/posting_list/src/value_handler.rs
(1 hunks)lib/posting_list/src/view.rs
(1 hunks)lib/posting_list/src/visitor.rs
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/posting_list/src/posting_list.rs (3)
lib/posting_list/src/view.rs (1)
visitor
(30-32)lib/posting_list/src/iterator.rs (1)
new
(16-22)lib/posting_list/src/visitor.rs (1)
new
(22-28)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consistency
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (5)
Cargo.toml (2)
254-262
: Workspace member added – remember to run the full CI matrixAdding
lib/posting_list
to[workspace].members
means all existing CI jobs (clippy, tests, doc, MSRV) will now try to build it.
Make sure the CI config (*.github/workflows, Makefile, etc.) does not exclude the new crate; otherwise the first green build could be misleading.
164-170
:❓ Verification inconclusive
Pinning
bitpacking = "0.9.2"
: please double-check for CVEs & feature mismatchesGood call adding the
bitpacking
crate, but compressed-integer crates occasionally receive security / correctness fixes (e.g. undefined behaviour on malformed input).
Before merging, please verify that 0.9.2 is still the latest non-yanked version and that no RUSTSEC advisories apply.
🏁 Script executed:
#!/usr/bin/env bash # Check for advisories & newer versions of bitpacking cargo install --locked cargo-audit >/dev/null 2>&1 cargo audit -d bitpackingLength of output: 160
Pinning
bitpacking = "0.9.2"
: please manually verify CVEs & version statusGood call adding the
bitpacking
crate, but compressed‐integer libraries sometimes receive security or correctness fixes (e.g. UB on malformed input). Before merging:
- Confirm that 0.9.2 is still the latest non-yanked version on crates.io
- Ensure no RUSTSEC advisories apply to
bitpacking
(e.g. by runningcargo audit -d bitpacking
locally)File:
Cargo.toml
Lines: approximately 164–170lib/posting_list/src/builder.rs (1)
81-93
: Integer-math relies on perfect divisibility – add debug assertions
chunk_bits
is recovered vialet chunk_bits = compressed_size * 8 / CHUNK_SIZE;If an invariant ever changes (e.g.
CHUNK_SIZE
!= 128 orcompressed_block_size
starts adding padding bytes), the division will truncate and silently corrupt
the decompressed IDs. A cheap safeguard:- let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_SIZE; + debug_assert_eq!( + (compressed_size * 8) % CHUNK_SIZE, + 0, + "compressed size {compressed_size} is not a multiple of CHUNK_SIZE={CHUNK_SIZE}" + ); + let chunk_bits = compressed_size * 8 / CHUNK_SIZE;lib/posting_list/src/tests.rs (1)
31-37
:⚠️ Potential issueMissing
std::mem
imports break compilation
size_of
/size_of_val
are used but never brought into scope.+use std::mem::{size_of, size_of_val};
Without the import the test crate will not compile.
⛔ Skipped due to learnings
Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
lib/posting_list/src/view.rs (1)
128-138
:start_chunk
precondition can panic in debug builds
debug_assert!(id >= chunks_slice[0].initial_id);
assumes that the caller
passes astart_chunk
that is not ahead of the soughtid
. If an
up-stream optimisation guesses a too-largestart_chunk
, the assertion will
trigger in debug but will silently return wrong results in release.Either document the contract explicitly or guard defensively:
- debug_assert!(id >= chunks_slice[0].initial_id); + if cfg!(debug_assertions) { + assert!( + id >= chunks_slice[0].initial_id, + "`start_chunk` ({start_chunk}) is past the target id ({id})" + ); + }
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value | ||
where | ||
N: Fn() -> Option<Self::Sized>, | ||
{ | ||
let range = match next_sized_value() { | ||
Some(next_value) => sized_value as usize..next_value as usize, | ||
None => sized_value as usize..var_data.len(), | ||
}; | ||
V::from_bytes(&var_data[range]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Missing bounds checks could panic on corrupted data
get_value
trusts both offsets to lie within var_data
.
A single‐bit flip in storage would cause a slice-panic and potentially abort the process.
Add debug or release-mode checks:
- let range = match next_sized_value() {
+ debug_assert!(
+ sized_value as usize <= var_data.len(),
+ "Corrupted posting list: start offset {} ≥ data length {}",
+ sized_value, var_data.len()
+ );
+
+ let range = match next_sized_value() {
Some(next_value) => sized_value as usize..next_value as usize,
None => sized_value as usize..var_data.len(),
};
A graceful error (Result
) would be even better, but a debug assert at minimum prevents undefined behaviour.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value | |
where | |
N: Fn() -> Option<Self::Sized>, | |
{ | |
let range = match next_sized_value() { | |
Some(next_value) => sized_value as usize..next_value as usize, | |
None => sized_value as usize..var_data.len(), | |
}; | |
V::from_bytes(&var_data[range]) | |
} | |
fn get_value<N>(sized_value: Self::Sized, next_sized_value: N, var_data: &[u8]) -> Self::Value | |
where | |
N: Fn() -> Option<Self::Sized>, | |
{ | |
debug_assert!( | |
sized_value as usize <= var_data.len(), | |
"Corrupted posting list: start offset {} ≥ data length {}", | |
sized_value, | |
var_data.len() | |
); | |
let range = match next_sized_value() { | |
Some(next_value) => sized_value as usize..next_value as usize, | |
None => sized_value as usize..var_data.len(), | |
}; | |
V::from_bytes(&var_data[range]) | |
} |
🤖 Prompt for AI Agents
In lib/posting_list/src/value_handler.rs around lines 74 to 83, the get_value
function slices var_data using offsets without bounds checks, risking a panic if
data is corrupted. Add debug assertions to verify that sized_value and
next_sized_value offsets are within var_data bounds before slicing. This
prevents undefined behavior by catching invalid offsets early. For a more robust
fix, consider changing the function to return a Result to handle errors
gracefully.
This comment was marked as outdated.
This comment was marked as outdated.
make both impls similar new design make impls similar generalize impls
new fn for builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We reviewed this in a call.
Could you add a test case that checks out of bound values? Below the lower and above the upper bound, and possibly also a value that is in-bound but doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
lib/posting_list/src/view.rs (1)
108-123
: Correct handling of chunk decompressionThe decompression logic is well-implemented. The use of
checked_sub(1)
without unwrapping on line 118 is intentional and correct, as theBitPacker::decompress_strictly_sorted
function expects anOption<PointOffsetType>
as its first parameter.
🧹 Nitpick comments (1)
lib/posting_list/src/view.rs (1)
153-191
: Consider adding runtime validation for assumptions in production buildsThe
find_chunk
method includes debug assertions to validate assumptions about the ID being within range, but these are only checked in debug builds. For a public method, consider adding runtime checks in production builds as well.pub fn find_chunk(&self, id: PointOffsetType, start_chunk: Option<usize>) -> Option<usize> { let remainders = self.remainders; let chunks = self.chunks; if chunks.is_empty() { return None; } // check if id is in the remainders list if remainders.first().is_some_and(|elem| id >= elem.id) { return None; } let start_chunk = start_chunk.unwrap_or(0); let chunks_slice = &chunks[start_chunk..]; if chunks_slice.is_empty() { return None; } + // Validate assumptions in production builds as well + if id < chunks_slice[0].initial_id || self.last_id.is_some_and(|last_id| id > last_id) { + return None; + } // No need to check if id is under range of posting list, // this function assumes it is within the range debug_assert!(id >= chunks_slice[0].initial_id); debug_assert!(self.last_id.is_some_and(|last_id| id <= last_id));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/posting_list/src/view.rs
(1 hunks)lib/posting_list/src/visitor.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/posting_list/src/visitor.rs
🧰 Additional context used
🧠 Learnings (1)
lib/posting_list/src/view.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.018Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: lint
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: test-consensus-compose
🔇 Additional comments (11)
lib/posting_list/src/view.rs (11)
1-21
: Well-designed struct with clear separation of concerns and proper use of genericsThe
PostingListView
struct is well designed, using type parameters and lifetimes appropriately. Using a phantom marker for the handler type is a good pattern for maintaining type safety without adding runtime overhead.
23-29
: Good design for component extractionThe
PostingListComponents
struct is a clean way to bundle related components for external use, improving API ergonomics by avoiding multiple return values.
31-47
: Clean implementation for ID-only view constructionThe
from_ids_components
method provides a clear way to create an ID-only view, with appropriate initialization of all fields including the emptyvar_size_data
slice.
49-65
: Good specialized implementation for sized valuesThis implementation provides specialized construction for posting lists with sized values, which helps maintain type safety while reusing common patterns.
67-70
: Concise visitor creation methodThe
visitor
method cleanly transfers ownership of the view to create a new visitor, which is an appropriate pattern for this use case.
72-89
: Effective use of destructuring for component extractionThe
components
method uses Rust's destructuring syntax effectively to extract all relevant fields from the view.
91-106
: Generic construction from componentsThe
from_components
method provides a generic way to construct views from components, which works well with the specialized methods for specific use cases.
124-130
: Good separation of checked and unchecked accessor methodsHaving both checked (
sized_values
) and unchecked (sized_values_unchecked
) methods with appropriate visibility modifiers is a good practice. The unchecked version is correctly marked aspub(crate)
to restrict its usage to the crate where safety can be enforced.
132-147
: Thorough range checking implementationThe
is_in_range
method thoroughly checks boundary conditions and handles empty lists gracefully using pattern matching with early returns.
193-197
: Efficient binary search for remaindersThe
search_in_remainders
method effectively uses Rust's binary search capabilities and appropriately returns anOption
type.
200-207
: Straightforward length and emptiness checksThe
len
andis_empty
methods are implemented correctly and provide a clear interface for checking the size of the posting list.
Thanks for the additional tests! 🙏 |
* use chunks_exact in compressed sparse posting list builder * generalized impls make both impls similar new design make impls similar generalize impls * separate into files in new crate * additional builder fns new fn for builder * retrieve current iterator position * simplify traits * restructure * add iterator * clippy add contains fn * fix find_chunk * move builders to builder * revamp generics and traits * add model test vs var-sized posting list * fmt * clippy * improve traits and generics * fix get_by_offset * generalize tests * improve test * restructure view into a new file * to and from components * value handler takes closure to prevent perf penalty * clippy * remove unused deps * revert changes in sparse index * self review nits * reword * clippy * edit comment * lock CHUNK_LEN, but assert it is synced with BitPacker4x * remove Sized constraint in iterator * rename VarSized* to Unsized* And add more from_components helpers * avoid intermediate allocation when writing UnsizedValue * checked u32 conversions * fix test * clippy * Add debug assertion, assert that point ID is in range * Bound check in get_by_offset * improve tests to check against various lengths and offsets * clippy 😤 --------- Co-authored-by: timvisee <tim@visee.me>
* use chunks_exact in compressed sparse posting list builder * generalized impls make both impls similar new design make impls similar generalize impls * separate into files in new crate * additional builder fns new fn for builder * retrieve current iterator position * simplify traits * restructure * add iterator * clippy add contains fn * fix find_chunk * move builders to builder * revamp generics and traits * add model test vs var-sized posting list * fmt * clippy * improve traits and generics * fix get_by_offset * generalize tests * improve test * restructure view into a new file * to and from components * value handler takes closure to prevent perf penalty * clippy * remove unused deps * revert changes in sparse index * self review nits * reword * clippy * edit comment * lock CHUNK_LEN, but assert it is synced with BitPacker4x * remove Sized constraint in iterator * rename VarSized* to Unsized* And add more from_components helpers * avoid intermediate allocation when writing UnsizedValue * checked u32 conversions * fix test * clippy * Add debug assertion, assert that point ID is in range * Bound check in get_by_offset * improve tests to check against various lengths and offsets * clippy 😤 --------- Co-authored-by: timvisee <tim@visee.me>
(sorry for the huge PR)
This posting list implementation should be able to replace (at least) the current one used in full-text index, and extend it so that it is possible to implement mmap friendly phrase-matching.
The main idea is that users of this crate can implement one of these two traits for their types:
SizedValue
- which is just a marker for aSized + Copy
typeVarSizedValue
- which needs the type to implement functions to serialize and deserialize into/from bytes.And just with this, get the functionality of a compressed posting list, where the ids are compressed, and have a value associated to each id.
If there is no value associated, the
()
(unit) type is used, and there should be no performance nor space penalty from this.The visitor interface currently only includes
contains(point_id)
andget_by_offset(offset)
methods. But it can also be turned into an iterator.