-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[phrase match] Generic mmap postings #6619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5391a3e
to
1ecfcf9
Compare
55a063e
to
9ef8a46
Compare
📝 Walkthrough""" WalkthroughThis change introduces several refactorings and extensions to the posting list and full text index infrastructure. The Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
151-374
: Excellent generic refactoring of MmapPostings.The conversion to a generic structure is well-executed:
- Proper use of
PhantomData
to maintain type information- Methods correctly updated to handle both fixed and variable-sized data
- Maintains backward compatibility through the
JustIds
typeConsider documenting why 4-byte alignment was chosen (line 23) for future maintainers, as different architectures might have different optimal alignment requirements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
lib/posting_list/src/lib.rs
(1 hunks)lib/posting_list/src/value_handler.rs
(5 hunks)lib/posting_list/src/view.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs
(11 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/positions.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
🧬 Code Graph Analysis (2)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (2)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
open
(341-359)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1)
open
(221-238)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
create
(253-339)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (19)
lib/segment/src/index/field_index/full_text_index/mod.rs (1)
6-6
: LGTM: Module addition aligns with generic posting value support.The addition of the
positions
module is consistent with the PR objective to makeMmapPostings
generic over stored values, enabling support for token positions.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)
41-42
: LGTM: Correct explicit type parameter for generic MmapPostings.The addition of the explicit type parameter
<()>
is necessary for the new genericMmapPostings
implementation. Using()
maintains compatibility with the previous behavior where postings contain only IDs without additional value data.lib/posting_list/src/lib.rs (1)
40-40
: LGTM: ValueHandler export supports generic posting abstraction.Adding
ValueHandler
to the public exports is appropriate for the generic posting value refactoring, enabling other modules to work with the trait for handling different posting value types.lib/posting_list/src/view.rs (2)
16-24
: LGTM: Well-designed generic posting list view abstraction.The generic
PostingListView<'a, V: PostingValue>
effectively replaces the specializedIdsPostingListView
, providing flexibility to handle different posting value types while maintaining type safety through proper constraints and phantom data usage.
86-103
: LGTM: Generic constructor maintains flexibility.The
from_components
constructor properly handles the generic value handler types, enabling creation of posting list views for different value types while maintaining the same interface pattern.lib/posting_list/src/value_handler.rs (1)
5-5
: LGTM! Improved cross-platform compatibility with little-endian encoding.The changes consistently replace primitive
u32
offsets withzerocopy::little_endian::U32
, ensuring deterministic byte ordering across platforms. All offset handling is properly updated to useU32::from()
for creation and.get()
for extraction while maintaining the original logic.Also applies to: 97-97, 106-106, 118-118, 123-123, 144-145
lib/segment/src/index/field_index/full_text_index/positions.rs (1)
1-29
: LGTM! Well-implemented variable-sized posting value type.The
Positions
struct is correctly implemented with:
- Proper
PostingValue
trait implementation usingUnsizedHandler
- Sound
UnsizedValue
implementation with correct length calculation- Safe serialization/deserialization using zerocopy traits
- Appropriate error handling with descriptive messages
The implementation integrates well with the generic posting list infrastructure.
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (3)
5-5
: LGTM! Added import for generic posting value support.The addition of
PostingValue
import enables the generic parameterization of internal functions.
32-33
: LGTM! Proper generic parameterization of intersection function.The function is correctly generalized with
V: PostingValue
constraint, enabling support for different posting value types while maintaining the original logic.
69-70
: LGTM! Consistent generic parameterization of check intersection function.The function follows the same pattern as
intersection
, properly generalized withV: PostingValue
constraint.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (5)
12-12
: LGTM! Added import for mmap posting value trait.The addition of
MmapPostingValue
import enables the generic parameterization ofMmapPostings
.
37-37
: LGTM! Explicit generic parameterization of MmapPostings field.The field type is correctly updated to
MmapPostings<()>
, explicitly specifying the unit type for postings without additional variable-sized data.
63-63
: LGTM! Updated method call with explicit generic parameter.The call to
MmapPostings::<()>::create
correctly specifies the generic type parameter to match the field type.
156-157
: LGTM! Proper generic parameterization of intersection function.The function is correctly generalized with
V: MmapPostingValue
constraint, parallel to the changes inimmutable_inverted_index.rs
, enabling support for different mmap posting value types.
200-201
: LGTM! Consistent generic parameterization of check intersection function.The function follows the same pattern as
intersection
, properly generalized withV: MmapPostingValue
constraint.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (4)
27-50
: Well-designed trait abstraction for memory-mapped posting values.The
MmapPostingValue
trait provides a clean abstraction for handling both fixed-size (IDs only) and variable-size (IDs + positions) posting data. The trait bounds ensure type safety with zerocopy operations, and the required methods provide a clear interface for implementations.
52-78
: Clean implementation for the ID-only posting case.The
JustIds
implementation correctly handles the case where no variable-sized data is stored. The debug assertion provides good runtime validation in debug builds.
80-105
: Correct implementation for variable-sized position data.The
Positions
implementation properly handles variable-sized data with appropriate alignment calculations that account for both ID and position data.
119-149
: Header structure properly extended for variable-sized data support.The addition of
var_size_data_bytes_count
field and the adjustment of reserved bytes maintains the struct size while adding new functionality. The genericposting_size
method correctly calculates the total size including variable data.
impl MmapPostingValue for Positions
9ef8a46
to
999c924
Compare
> | ||
{ | ||
/// Length in bytes of the variable size data section. | ||
fn var_size_data_len(var_size_data: &<Self::Handler as ValueHandler>::VarSizeData) -> usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood from the call review, VarSizeData
can either be ()
or [u8]
, and if we always use [u8]
, the price is that we spend extra 16bytes per posting list. But I think it is worth paying, cause otherwise complexity of the code is quite significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified in a06ab78
<[PostingChunk<()>]>::ref_from_prefix_with_elems(bytes, header.chunks_count as usize) | ||
.ok()?; | ||
let (data, bytes) = bytes.split_at(header.data_bytes_count as usize); | ||
<[PostingChunk<<V::Handler as ValueHandler>::Sized>]>::ref_from_prefix_with_elems( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, let's try to use type alias for V::Handler as ValueHandler>::Sized>
as it becomes bulky. Also, it would be nic eto mention in comments next to the alias definition an example of typical type which is actually going to be used there, as it simplifies intuition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also handled in a06ab78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
lib/posting_list/src/posting_list.rs (1)
10-13
: Type alias usage improves code readability.The refactoring to use
SizedTypeFor<V>
type alias and standardizedVec<u8>
for variable-size data successfully addresses the complexity mentioned in past reviews. This makes the code more maintainable while preserving the same functionality.Also applies to: 30-32
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
237-237
: Update or remove misleading comment.The comment states "not used with just ids postings" but the code below (lines 256-258) does handle
var_size_data
when it's not empty. This comment appears outdated given the generic implementation.- var_size_data, // not used with just ids postings + var_size_data,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/posting_list/src/lib.rs
(2 hunks)lib/posting_list/src/posting_list.rs
(2 hunks)lib/posting_list/src/value_handler.rs
(7 hunks)lib/posting_list/src/view.rs
(6 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs
(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- lib/posting_list/src/lib.rs
- lib/posting_list/src/value_handler.rs
- lib/posting_list/src/view.rs
🧰 Additional context used
🧠 Learnings (2)
lib/posting_list/src/posting_list.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: lint
- GitHub Check: test-consensus-compose
- GitHub Check: test-low-resources
- GitHub Check: integration-tests
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (4)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (4)
25-38
: Well-designed trait for mmap compatibility.The
MmapPostingValue
trait effectively constrains the generic parameter to ensure all necessary zerocopy traits are satisfied for mmap operations. The implementations for()
andPositions
provide the expected use cases.
51-68
: Header structure properly supports variable-sized data.The separation of byte counts for ID data and variable-sized data enables the generic postings design while maintaining backward compatibility (var_size_data_bytes_count = 0 for ID-only postings).
71-77
: Size calculation correctly accounts for all data components.The generic
posting_size
method properly includes variable-sized data in the total size calculation, ensuring accurate memory layout calculations.
121-162
: Data layout parsing correctly handles variable-sized data.The
get_view
method properly splits the mmap data into ID and variable-sized sections while maintaining alignment. The sequential parsing ensures correct memory layout interpretation.
* generic MmapPostings<V> impl MmapPostingValue for Positions * simplify associated types
Builds on top of #6617
Make
MmapPostings
generic over the stored value along with the ids.To achieve this, it was necessary to create a trait specific to full text index (
MmapPostingValue
), so that it is still compatible with previous implementation, but can also store positions if the right type is used.