Skip to content

Conversation

coszio
Copy link
Contributor

@coszio coszio commented May 30, 2025

Builds on top of #6617

Make MmapPostings generic over the stored value along with the ids.

To achieve this, it was necessary to create a trait specific to full text index (MmapPostingValue), so that it is still compatible with previous implementation, but can also store positions if the right type is used.

@coszio coszio changed the title Generic mmap postings [phrase match] Generic mmap postings May 30, 2025
@coszio coszio added this to the Phrase matching milestone May 30, 2025
@coszio coszio force-pushed the ParsedQuery-enum branch 2 times, most recently from 5391a3e to 1ecfcf9 Compare June 2, 2025 17:26
Base automatically changed from ParsedQuery-enum to dev June 2, 2025 18:26
@coszio coszio force-pushed the generic-mmap-postings branch from 55a063e to 9ef8a46 Compare June 2, 2025 18:59
@coszio coszio marked this pull request as ready for review June 2, 2025 19:01
@coszio coszio requested a review from generall June 2, 2025 19:01
Copy link
Contributor

coderabbitai bot commented Jun 2, 2025

📝 Walkthrough

"""

Walkthrough

This change introduces several refactorings and extensions to the posting list and full text index infrastructure. The ValueHandler trait is now publicly re-exported. The UnsizedHandler implementation is updated to use the U32 type from the zerocopy::little_endian module for offsets instead of primitive u32. The IdsPostingListView struct and its associated constructor are removed. The full text index and memory-mapped postings modules are refactored to be generic over posting value types, introducing the MmapPostingValue trait and a new Positions struct for storing token positions. A new positions module is added.

Possibly related PRs

Suggested reviewers

  • timvisee
  • generall
    """
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)

151-374: Excellent generic refactoring of MmapPostings.

The conversion to a generic structure is well-executed:

  • Proper use of PhantomData to maintain type information
  • Methods correctly updated to handle both fixed and variable-sized data
  • Maintains backward compatibility through the JustIds type

Consider documenting why 4-byte alignment was chosen (line 23) for future maintainers, as different architectures might have different optimal alignment requirements.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0525a3 and 9ef8a46.

📒 Files selected for processing (9)
  • lib/posting_list/src/lib.rs (1 hunks)
  • lib/posting_list/src/value_handler.rs (5 hunks)
  • lib/posting_list/src/view.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (11 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
🧬 Code Graph Analysis (2)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (2)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
  • open (341-359)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1)
  • open (221-238)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
  • create (253-339)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (19)
lib/segment/src/index/field_index/full_text_index/mod.rs (1)

6-6: LGTM: Module addition aligns with generic posting value support.

The addition of the positions module is consistent with the PR objective to make MmapPostings generic over stored values, enabling support for token positions.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)

41-42: LGTM: Correct explicit type parameter for generic MmapPostings.

The addition of the explicit type parameter <()> is necessary for the new generic MmapPostings implementation. Using () maintains compatibility with the previous behavior where postings contain only IDs without additional value data.

lib/posting_list/src/lib.rs (1)

40-40: LGTM: ValueHandler export supports generic posting abstraction.

Adding ValueHandler to the public exports is appropriate for the generic posting value refactoring, enabling other modules to work with the trait for handling different posting value types.

lib/posting_list/src/view.rs (2)

16-24: LGTM: Well-designed generic posting list view abstraction.

The generic PostingListView<'a, V: PostingValue> effectively replaces the specialized IdsPostingListView, providing flexibility to handle different posting value types while maintaining type safety through proper constraints and phantom data usage.


86-103: LGTM: Generic constructor maintains flexibility.

The from_components constructor properly handles the generic value handler types, enabling creation of posting list views for different value types while maintaining the same interface pattern.

lib/posting_list/src/value_handler.rs (1)

5-5: LGTM! Improved cross-platform compatibility with little-endian encoding.

The changes consistently replace primitive u32 offsets with zerocopy::little_endian::U32, ensuring deterministic byte ordering across platforms. All offset handling is properly updated to use U32::from() for creation and .get() for extraction while maintaining the original logic.

Also applies to: 97-97, 106-106, 118-118, 123-123, 144-145

lib/segment/src/index/field_index/full_text_index/positions.rs (1)

1-29: LGTM! Well-implemented variable-sized posting value type.

The Positions struct is correctly implemented with:

  • Proper PostingValue trait implementation using UnsizedHandler
  • Sound UnsizedValue implementation with correct length calculation
  • Safe serialization/deserialization using zerocopy traits
  • Appropriate error handling with descriptive messages

The implementation integrates well with the generic posting list infrastructure.

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (3)

5-5: LGTM! Added import for generic posting value support.

The addition of PostingValue import enables the generic parameterization of internal functions.


32-33: LGTM! Proper generic parameterization of intersection function.

The function is correctly generalized with V: PostingValue constraint, enabling support for different posting value types while maintaining the original logic.


69-70: LGTM! Consistent generic parameterization of check intersection function.

The function follows the same pattern as intersection, properly generalized with V: PostingValue constraint.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (5)

12-12: LGTM! Added import for mmap posting value trait.

The addition of MmapPostingValue import enables the generic parameterization of MmapPostings.


37-37: LGTM! Explicit generic parameterization of MmapPostings field.

The field type is correctly updated to MmapPostings<()>, explicitly specifying the unit type for postings without additional variable-sized data.


63-63: LGTM! Updated method call with explicit generic parameter.

The call to MmapPostings::<()>::create correctly specifies the generic type parameter to match the field type.


156-157: LGTM! Proper generic parameterization of intersection function.

The function is correctly generalized with V: MmapPostingValue constraint, parallel to the changes in immutable_inverted_index.rs, enabling support for different mmap posting value types.


200-201: LGTM! Consistent generic parameterization of check intersection function.

The function follows the same pattern as intersection, properly generalized with V: MmapPostingValue constraint.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (4)

27-50: Well-designed trait abstraction for memory-mapped posting values.

The MmapPostingValue trait provides a clean abstraction for handling both fixed-size (IDs only) and variable-size (IDs + positions) posting data. The trait bounds ensure type safety with zerocopy operations, and the required methods provide a clear interface for implementations.


52-78: Clean implementation for the ID-only posting case.

The JustIds implementation correctly handles the case where no variable-sized data is stored. The debug assertion provides good runtime validation in debug builds.


80-105: Correct implementation for variable-sized position data.

The Positions implementation properly handles variable-sized data with appropriate alignment calculations that account for both ID and position data.


119-149: Header structure properly extended for variable-sized data support.

The addition of var_size_data_bytes_count field and the adjustment of reserved bytes maintains the struct size while adding new functionality. The generic posting_size method correctly calculates the total size including variable data.

impl MmapPostingValue for Positions
@coszio coszio force-pushed the generic-mmap-postings branch from 9ef8a46 to 999c924 Compare June 2, 2025 19:13
>
{
/// Length in bytes of the variable size data section.
fn var_size_data_len(var_size_data: &<Self::Handler as ValueHandler>::VarSizeData) -> usize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood from the call review, VarSizeData can either be () or [u8], and if we always use [u8], the price is that we spend extra 16bytes per posting list. But I think it is worth paying, cause otherwise complexity of the code is quite significant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified in a06ab78

<[PostingChunk<()>]>::ref_from_prefix_with_elems(bytes, header.chunks_count as usize)
.ok()?;
let (data, bytes) = bytes.split_at(header.data_bytes_count as usize);
<[PostingChunk<<V::Handler as ValueHandler>::Sized>]>::ref_from_prefix_with_elems(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, let's try to use type alias for V::Handler as ValueHandler>::Sized> as it becomes bulky. Also, it would be nic eto mention in comments next to the alias definition an example of typical type which is actually going to be used there, as it simplifies intuition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also handled in a06ab78

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
lib/posting_list/src/posting_list.rs (1)

10-13: Type alias usage improves code readability.

The refactoring to use SizedTypeFor<V> type alias and standardized Vec<u8> for variable-size data successfully addresses the complexity mentioned in past reviews. This makes the code more maintainable while preserving the same functionality.

Also applies to: 30-32

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)

237-237: Update or remove misleading comment.

The comment states "not used with just ids postings" but the code below (lines 256-258) does handle var_size_data when it's not empty. This comment appears outdated given the generic implementation.

-                var_size_data, // not used with just ids postings
+                var_size_data,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 999c924 and a06ab78.

📒 Files selected for processing (5)
  • lib/posting_list/src/lib.rs (2 hunks)
  • lib/posting_list/src/posting_list.rs (2 hunks)
  • lib/posting_list/src/value_handler.rs (7 hunks)
  • lib/posting_list/src/view.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • lib/posting_list/src/lib.rs
  • lib/posting_list/src/value_handler.rs
  • lib/posting_list/src/view.rs
🧰 Additional context used
🧠 Learnings (2)
lib/posting_list/src/posting_list.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: lint
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (4)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (4)

25-38: Well-designed trait for mmap compatibility.

The MmapPostingValue trait effectively constrains the generic parameter to ensure all necessary zerocopy traits are satisfied for mmap operations. The implementations for () and Positions provide the expected use cases.


51-68: Header structure properly supports variable-sized data.

The separation of byte counts for ID data and variable-sized data enables the generic postings design while maintaining backward compatibility (var_size_data_bytes_count = 0 for ID-only postings).


71-77: Size calculation correctly accounts for all data components.

The generic posting_size method properly includes variable-sized data in the total size calculation, ensuring accurate memory layout calculations.


121-162: Data layout parsing correctly handles variable-sized data.

The get_view method properly splits the mmap data into ID and variable-sized sections while maintaining alignment. The sequential parsing ensures correct memory layout interpretation.

@coszio coszio requested a review from generall June 3, 2025 18:15
@coszio coszio merged commit 96ca31a into dev Jun 3, 2025
17 checks passed
@coszio coszio deleted the generic-mmap-postings branch June 3, 2025 19:21
generall pushed a commit that referenced this pull request Jul 17, 2025
* generic MmapPostings<V>

impl MmapPostingValue for Positions

* simplify associated types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants