-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Full text index with generic posting list #6565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eadb3fb
to
e708e96
Compare
let smallest_posting_idx = postings | ||
.iter() | ||
.enumerate() | ||
.min_by_key(|(_idx, posting)| posting.len()) | ||
.map(|(idx, _posting)| idx) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but have we considered sorting all posting lists, so that the smallest posting lists are checked first?
5272bfc
to
c2d4339
Compare
improve iterator intersection
c2d4339
to
13d9138
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is a copy of mmap_postings.rs
before the changes in this PR.
This file, along with the entire compressed_posting
module can safely be deleted later. For now we keep them to ensure compatibility of file format
📝 Walkthrough""" WalkthroughThis change refactors the posting list implementation in the full-text index system by introducing a generic, zero-copy, little-endian-compliant posting list structure using the Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
1-17
:⚠️ Potential issueMissing
size_of
import mirrors the issue in the legacy file
size_of
is referenced dozens of times but never brought into scope.use std::io::Write; use std::path::PathBuf; +use std::mem::size_of; // ← add
🧹 Nitpick comments (8)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)
51-59
: Consider using test assertion libraries for better error messages.While the current assertion is functional, consider using a testing library like
pretty_assertions
for more readable output when assertions fail, especially when comparing complex data structures.- assert!( - id_model == id_old && id_model == elem_new.id, - "Mismatch at token_id {token_id}: offset: {offset}, old: {id_old}, new: {}, model: {id_model}", - elem_new.id - ); + use pretty_assertions::assert_eq; + assert_eq!(id_model, id_old, "Model vs Old mismatch at token_id {token_id}, offset: {offset}"); + assert_eq!(id_model, elem_new.id, "Model vs New mismatch at token_id {token_id}, offset: {offset}");lib/posting_list/src/visitor.rs (1)
159-163
: Unnecessary temporary when re-emitting remainder elementYou allocate
id
then immediately callid.get()
; the binding is redundant.-let id = e.id; -… -PostingElement { id: id.get(), value } +PostingElement { + id: e.id.get(), + value, +}lib/posting_list/src/iterator.rs (1)
70-77
: Avoid double-cloningPostingElement
innext()
next()
currently clones the element once to feedcurrent_elem
and again to hand the value to the caller (Option::inspect
+clone
).
ForH::Value
types larger than zero bytes this is unnecessary churn.A lighter variant keeps a reference or performs a single move:
- let next_opt = self.visitor.get_by_offset(self.offset).inspect(|_| { - self.offset += 1; - }); - self.current_elem = next_opt.clone(); - next_opt + let next_opt = self.visitor.get_by_offset(self.offset); + if let Some(ref elem) = next_opt { + self.offset += 1; + self.current_elem = Some(elem.clone()); + } + next_optThis preserves semantics while eliminating one clone per element.
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)
24-30
: Rename thefilter
parameter for clarityInside the iterator pipeline the parameter
filter
is invoked inside a.filter(...)
adaptor, which easily leads to “filter inside filter” mental gymnastics.
Consider renaming the argument to something likedoc_filter
orpre_filter
to make the intent crystal-clear.-fn intersect_compressed_postings_iterator<'a>( - mut postings: Vec<IdsPostingListView<'a>>, - filter: impl Fn(PointOffsetType) -> bool + 'a, +fn intersect_compressed_postings_iterator<'a>( + mut postings: Vec<IdsPostingListView<'a>>, + doc_filter: impl Fn(PointOffsetType) -> bool + 'a,…and update the usage accordingly.
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (1)
164-173
: Builder loop could be replaced by a zero-copy or bulk constructorThe conversion allocates a new
PostingBuilder
for every posting and pushes IDs one-by-one:for id in posting.iter() { builder.add_id(id); }If
PostingBuilder
(orIdsPostingList
) already provides afrom_iter
,extend
, or zero-copyfrom_sorted_slice
API, using it will:
- Avoid the per-ID method dispatch overhead.
- Allow the builder to pre-allocate the exact capacity, reducing reallocations.
- Shorten the conversion code.
Consider exposing such an API in
posting_list
and using it here.lib/posting_list/src/posting_list.rs (1)
105-110
:iter()
forcesClone
unnecessarilyRequiring
H::Value: Clone
prevents iterating over lists whose values are non-cloneable (e.g.,Box<dyn …>
).
Consider yielding&H::Value
(orCow
) fromPostingIterator
so that cloning is not mandatory.
This widens API usability without performance drawbacks forCopy
types.lib/posting_list/src/view.rs (2)
68-78
:IntoIterator
imposes aClone
bound on values – is it necessary?Requiring
H::Value: Clone
propagates to every caller that wants to iterate
over aPostingListView
, even if they only need borrowed data. If the value
type is large (e.g. a weighted payload), the extraClone
can be costly or
impossible.If the iterator already yields references (or
Copy
types), you can drop the
bound and avoid the constraint:-impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H> -where - H::Value: Clone, +impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H>
85-96
:to_owned
performs full allocations – consider zero-copy alternativesCloning
id_data
,var_size_data
, and the entirechunks
/remainders
arrays can be expensive for large posting lists. If the only goal is to
obtain an owned version that lives longer than the source slice, a cheap
Arc<[u8]>
/Arc<[T]>
(orBox<[_]>
) could provide ownership without data
duplication.This will reduce peak memory and improve conversion speed for large indexes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (20)
lib/posting_list/Cargo.toml
(1 hunks)lib/posting_list/src/builder.rs
(4 hunks)lib/posting_list/src/iterator.rs
(2 hunks)lib/posting_list/src/lib.rs
(2 hunks)lib/posting_list/src/posting_list.rs
(4 hunks)lib/posting_list/src/tests.rs
(1 hunks)lib/posting_list/src/value_handler.rs
(3 hunks)lib/posting_list/src/view.rs
(9 hunks)lib/posting_list/src/visitor.rs
(5 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs
(0 hunks)lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs
(7 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs
(4 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/posting_list.rs
(0 hunks)lib/segment/src/index/field_index/full_text_index/postings_iterator.rs
(5 hunks)
💤 Files with no reviewable changes (2)
- lib/segment/src/index/field_index/full_text_index/posting_list.rs
- lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/posting_list/src/lib.rs (2)
lib/posting_list/src/posting_list.rs (2)
view
(80-98)visitor
(100-103)lib/posting_list/src/view.rs (1)
visitor
(81-83)
lib/posting_list/src/visitor.rs (3)
lib/posting_list/src/posting_list.rs (1)
is_empty
(116-118)lib/posting_list/src/view.rs (1)
is_empty
(230-232)lib/segment/src/index/field_index/full_text_index/posting_list.rs (1)
is_empty
(34-36)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4)
lib/posting_list/src/posting_list.rs (2)
visitor
(100-103)iter
(105-110)lib/posting_list/src/view.rs (1)
visitor
(81-83)lib/posting_list/src/visitor.rs (1)
contains
(100-118)lib/segment/src/index/field_index/full_text_index/posting_list.rs (2)
contains
(39-41)iter
(44-46)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (26)
lib/posting_list/Cargo.toml (1)
15-15
: Good addition of zerocopy dependency.Adding the zerocopy crate as a dependency aligns with the PR's goal of implementing a generic, zero-copy posting list structure. This will enable more efficient memory layout guarantees and safer serialization/deserialization operations.
lib/segment/Cargo.toml (1)
120-120
: Appropriate integration of posting_list crate.Adding the posting_list crate as a dependency is a good approach for integrating the new generic posting list implementation. This dependency enables the segment library to use the improved posting list abstractions and data structures.
lib/segment/src/index/field_index/full_text_index/mod.rs (1)
11-12
: Good approach for maintaining backward compatibility.Adding the
#[cfg(test)]
attribute to the compressed_posting module is a sound strategy during this transition phase. It ensures the old implementation is available for compatibility tests while removing it from production code, aligning with the PR objective of replacing the compressed posting list implementation while maintaining compatibility.lib/posting_list/src/tests.rs (1)
112-112
: Appropriate trait bound addition for testing.Adding the
std::fmt::Debug
trait bound toH::Value
in the test function is beneficial for producing meaningful assertion failure messages. This aligns with the broader changes whereDebug
was removed from production code trait bounds but kept for testing purposes.lib/posting_list/src/value_handler.rs (4)
14-14
: Appropriate removal of Debug constraint.Removing the
Debug
trait bound fromValue
type makes theValueHandler
trait more flexible and less restrictive, which is good practice for generic abstractions.
16-16
: Appropriate removal of Debug constraint.Similar to the
Value
type, removing theDebug
bound fromSized
type increases flexibility while maintaining the essentialSized + Copy
bounds needed for the type's functionality.
34-34
: Good addition of trait derivations.Adding
#[derive(Debug, Clone, Copy)]
toSizedHandler
ensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.
54-54
: Good addition of trait derivations.Adding
#[derive(Debug, Clone, Copy)]
toUnsizedHandler
ensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.lib/posting_list/src/lib.rs (3)
18-18
: Appropriate removal of Debug constraint.Removing the
Debug
trait bound fromSizedValue
aligns with the changes in value_handler.rs, creating a more flexible trait that focuses only on the essential requirements (Sized + Copy
).
24-24
: Appropriate removal of Debug constraint.Removing the
Debug
trait bound fromUnsizedValue
makes this trait more flexible and consistent with the changes to other traits in this refactoring.
51-54
: Good expansion of public API.Exporting additional types (
RemainderPosting
,ValueHandler
,PostingListComponents
, andPostingVisitor
) appropriately expands the public API surface to support the new generic posting list architecture. This enables users of the library to work with these abstractions directly.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)
13-13
: Good replacement of ChunkReader import.Replacing the import of
ChunkReader
withIdsPostingListView
correctly implements the refactoring goal of using the generic posting list instead of the specialized compressed posting list.
24-28
: Good approach for maintaining backward compatibility.Adding test-only modules for old implementations is a prudent approach for ensuring backward compatibility during this refactoring. This aligns with the PR objective of maintaining compatibility with existing data structures.
137-137
: Appropriate return type update.Changing the return type of
iter_postings
toimpl Iterator<Item = Option<IdsPostingListView<'a>>>
correctly reflects the refactoring to use the generic posting list types.
282-283
: Method call updated correctly.Changing from a method like
.reader()
to.visitor()
properly adapts to the new API of the generic posting list. Thevisitor()
method provides access to the functionality needed to check if a posting list contains a specific point ID.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)
13-61
: Well-designed compatibility test.This test effectively verifies compatibility between old and new posting list implementations by:
- Generating random sets of document IDs with various lengths
- Creating posting lists using the old implementation
- Writing them to a file using the old format
- Reading them back using both old and new implementations
- Verifying that both implementations return identical IDs
This comprehensive test is crucial for ensuring backward compatibility during this refactoring.
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)
289-293
: Borrow checker issue mirrors the previous one
new_posting.iter()
yieldsPostingElement
; you then map theid
and call
orig_posting.contains(point_id)
, which is fine.
However theid
field is aU32
wrapper in the new code. Call.get()
to
obtain aPointOffsetType
explicitly and make the intent obvious:- .map(|elem| elem.id) + .map(|elem| elem.id.get())lib/posting_list/src/builder.rs (1)
83-87
: Overflow check forPointOffsetType → U32
conversion
id
is aPointOffsetType
(aliased tou32
today) but the codebase has
started experimenting withu64
in places. If that change lands, a silent
wrap-around here would corrupt postings.Add an explicit check:
-remainders.push(RemainderPosting { - id: U32::from(id), +remainders.push(RemainderPosting { + id: U32::from( + u32::try_from(id).expect("point id must fit into 32 bits") + ), value, });lib/posting_list/src/visitor.rs (3)
33-35
:is_empty
is fine but duplicate oflen() == 0
The helper is handy; consider delegating to
self.len() == 0
or marking the
method#[inline]
to make the intention explicit.
No action required.
50-98
: Edge-case:search_greater_or_equal
skips the first remainderWhen
id
is smaller than every element in the posting list
(find_chunk
returnsNone
) the remainder search starts immediately.
If the list contains chunks,search_in_remainders
looks into remainders only,
so the very first element (initial id of chunk 0) is ignored.Consider short-circuiting before remainder search:
// after `find_chunk` if chunk_index.is_none() { // the first element in the list is the smallest id return Some(0); }Otherwise a query for an id below the posting list range yields
None
instead
of offset 0.
167-170
: ExtraClone
bound correctly addedThe iterator now yields cloned values – the new bound is required and properly
scoped.
Looks good.lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)
42-55
: Duplicate document IDs can slip through the intersectionThe algorithm emits one value per occurrence in the smallest posting list.
If that list contains duplicates, the resulting stream will also contain duplicates even though an intersection is typically expected to behave like a mathematical set.If duplicates are not meaningful for downstream consumers, deduplicate here:
let and_iter = smallest_posting_iterator .map(|elem| elem.id) .dedup() // from itertools, or roll your own simple `scan`-based dedup .filter(move |id| { /* unchanged */ })Alternatively, guarantee uniqueness earlier when building
IdsPostingList
.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1)
1-13
:⚠️ Potential issue
size_of
is used but never imported – compilation will fail
size_of::<T>()
is referenced multiple times in the file, yetstd::mem::size_of
is not in scope.
Rust will not implicitly import it, producing anunresolved import
error during build.use std::io::Write; use std::path::PathBuf; +use std::mem::size_of; // ← add // …
⛔ Skipped due to learnings
Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
181-187
: Risk of silent overflow whenremainder_count
exceeds 255
remainder_count
is stored asu8
, butremainders.len()
is cast without a check.
While current builders keep remainders below 128 (CHUNK_LEN
), a future change could exceed this and silently wrap.Consider changing the header field to
u16
(and adjust serialization accordingly) or at least assert:assert!(remainders.len() <= u8::MAX as usize);lib/posting_list/src/posting_list.rs (1)
112-118
:len()
can over-report when the last chunk is not completely filled
len()
assumes every stored chunk containsCHUNK_LEN
elements:self.chunks.len() * CHUNK_LEN + self.remainders.len()If the builder ever stores a partially-filled chunk (e.g., for an all-full compression path in the future),
this will overestimate. Either document that each chunk is guaranteed full, or compute the actual count
by decoding the last chunk’s header/offset.lib/posting_list/src/view.rs (1)
24-30
: Type asymmetry forlast_id
between view and components is confusing
PostingListView
storeslast_id
asOption<PointOffsetType>
while
PostingListComponents
usesOption<U32>
, requiring ad-hoc conversions in
both directions. Aligning these types (e.g. usingU32
everywhere) would
simplify the API and prevent accidental loss of endianness guarantees.
lib/segment/src/index/field_index/full_text_index/inverted_index.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs
Outdated
Show resolved
Hide resolved
lib/posting_list/src/view.rs
Outdated
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | ||
BitPackerImpl::new().decompress_strictly_sorted( | ||
chunk.initial_id.checked_sub(1), | ||
&self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size], | ||
BitPackerImpl::new().decompress_sorted( | ||
chunk.initial_id.get(), | ||
&self.id_data | ||
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | ||
decompressed_chunk, | ||
chunk_bits as u8, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add overflow / range-check when recovering chunk_bits
chunk_bits
is reconstructed via integer arithmetic and then cast to u8
.
If the compressed size is corrupted or larger than expected, the computed
value may exceed u8::MAX
, leading to silent truncation and UB-adjacent
behaviour inside bitpacking
.
Consider guarding the cast in debug builds (or even release) to fail fast:
- let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+ let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+ debug_assert!(
+ chunk_bits <= u8::MAX as usize,
+ "bit width {} exceeds u8::MAX – corrupted posting list?",
+ chunk_bits
+ );
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | |
BitPackerImpl::new().decompress_strictly_sorted( | |
chunk.initial_id.checked_sub(1), | |
&self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size], | |
BitPackerImpl::new().decompress_sorted( | |
chunk.initial_id.get(), | |
&self.id_data | |
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | |
decompressed_chunk, | |
chunk_bits as u8, | |
); | |
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | |
+ debug_assert!( | |
+ chunk_bits <= u8::MAX as usize, | |
+ "bit width {} exceeds u8::MAX – corrupted posting list?", | |
+ chunk_bits | |
+ ); | |
BitPackerImpl::new().decompress_sorted( | |
chunk.initial_id.get(), | |
&self.id_data | |
[chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | |
decompressed_chunk, | |
chunk_bits as u8, | |
); |
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs around lines 142 to 149, the calculation of
chunk_bits involves integer arithmetic and is cast to u8 without checks, risking
overflow and silent truncation if compressed_size is corrupted or too large. Add
a range check before casting chunk_bits to u8 to ensure it does not exceed
u8::MAX, and handle the error appropriately, such as by panicking in debug
builds or returning a Result error, to fail fast and avoid undefined behavior.
protects against the case where the iterator's current element hasn't been extracted yet
3566915
to
726975b
Compare
726975b
to
1b12a92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/posting_list/src/view.rs (1)
143-143
: Add overflow protection when calculating chunk_bits.This calculation can overflow and cause silent truncation if
compressed_size
is corrupted or unexpectedly large, leading to undefined behavior in bitpacking operations.The same overflow issue flagged in previous reviews still exists. Consider adding the suggested range check:
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; +debug_assert!( + chunk_bits <= u8::MAX as usize, + "bit width {} exceeds u8::MAX – corrupted posting list?", + chunk_bits +);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/posting_list/src/iterator.rs
(2 hunks)lib/posting_list/src/posting_list.rs
(4 hunks)lib/posting_list/src/tests.rs
(9 hunks)lib/posting_list/src/view.rs
(9 hunks)lib/posting_list/src/visitor.rs
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- lib/posting_list/src/tests.rs
- lib/posting_list/src/posting_list.rs
- lib/posting_list/src/iterator.rs
- lib/posting_list/src/visitor.rs
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
lib/posting_list/src/view.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: storage-compat-test
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: lint
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (5)
lib/posting_list/src/view.rs (5)
69-79
: IntoIterator implementation looks well-designed.The IntoIterator implementation properly delegates to the visitor pattern and includes appropriate trait bounds requiring
H::Value: Clone
. This maintains consistency with the iterator design.
86-97
: Good design choice for to_owned method.The comment correctly explains why this isn't implemented as the
ToOwned
trait - the borrow relationship would be problematic. The implementation properly clones all components to create an ownedPostingList
.
145-153
: Correct usage of zerocopy types in decompress_chunk.The changes properly use
.get()
methods to access fields from zerocopy types (U32
), ensuring correct little-endian byte order handling. The slice bounds are calculated correctly.
163-174
: Improved type safety in ids_range method.The return type change to
RangeInclusive<u32>
provides better type safety. The logic correctly handles empty posting lists and properly extracts IDs using.get()
methods from zerocopy types.
184-219
: Consistent zerocopy type usage in search methods.All ID comparisons now properly use
.get()
to extract values fromU32
types, maintaining consistency with the zerocopy migration. The binary search logic remains correct.
|
||
/// A non-owning view of [`PostingList`]. | ||
#[derive(Debug, Clone)] | ||
#[derive(Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Verify impact of removing Clone derive from PostingListView.
The Clone
derive has been removed from PostingListView
. This is a breaking change that may affect code depending on cloning this type.
🏁 Script executed:
#!/bin/bash
# Description: Check for code that depends on cloning PostingListView
# Expected: Find any usage of .clone() on PostingListView instances
rg -A 3 -B 3 "PostingListView.*\.clone\(\)"
rg -A 3 -B 3 "\.clone\(\).*PostingListView"
ast-grep --pattern 'let $_ = $var.clone()' | grep -i postinglistview
Length of output: 162
No internal .clone()
usage found on PostingListView
We ran ripgrep and ast-grep across the repo to locate any .clone()
calls on PostingListView
—none were found. Removing Clone
from PostingListView
does not break internal code. However, since this type is part of the public API, this is a breaking change for downstream consumers:
- Bump the crate’s major version.
- Document the removal of
Clone
in the changelog or migration guide.
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs at line 15, the Clone derive was removed from
PostingListView, which breaks downstream code relying on cloning this type. To
fix this, update the crate's major version to reflect the breaking change and
add clear documentation about the removal of Clone in the changelog or migration
guide to inform users of the API change.
@@ -8,28 +8,75 @@ use crate::visitor::PostingVisitor; | |||
|
|||
pub struct PostingIterator<'a, H: ValueHandler> { | |||
visitor: PostingVisitor<'a, H>, | |||
current_id: Option<PointOffsetType>, | |||
current_elem: Option<PostingElement<H::Value>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change now requires Copy
. What the reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that we can cache the value and avoid going into the posting list if not necessary
/// Internal structure to store the remainder of the posting list. So that IntoBytes can be derived | ||
#[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)] | ||
#[repr(C)] | ||
pub struct RemainderPosting<S: Sized> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is internal, can it be not pub
? Or at elast pub(crate)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be public because we need to construct a PostingListView from the specific mmap postings implementation. I guess the correct description would be
Structure to store (...) The difference with PostingElement is that this is zero-copy friendly
#[repr(C)] | ||
pub struct RemainderPosting<S: Sized> { | ||
pub id: U32, | ||
pub value: S, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a nitpick, but why S
here and V
in PostingElement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is not necessarily the actual value for unsized values, but rather whatever is used as fixed-sized value (in case of unsized values, it is a u32 pointing into var-sized-data)
* integrate generic posting list in full-text index * improve intersection fn improve iterator intersection * make old module just for test * recover old implementation of MmapPostings for tests * add compatibility test * fix compression * clippy * nits * check if first id is already greater protects against the case where the iterator's current element hasn't been extracted yet * update current element when there is no greater or equal * reuse visitor in test * optimize alignment * bound checks and refactor is_in_range * review * edit comment --------- Co-authored-by: generall <andrey@vasnetsov.com>
builds on top of #6528
This PR:
todo