Use null index more #6504

coszio · 2025-05-07T20:54:23Z

Currently null index was only working for is_empty=false and is_null=true cases.

By making the index aware of the max_point_offset available, we can also make it work for is_empty=true and is_null=false.

This PR also builds the index even if the IndexSelector chooses RocksDB-based index. Now the null index is chosen regardless of appendable and on-disk parameters.

Perf gains

In some local manual tests, I can see about 1.8x difference between dev and branch. It is quite small (~2.5ms vs ~1.4ms) but noticeable

# upload 1M points without payload
bfb -n 1000000 --dim 64

# this will apply int payloads to ~10% of points, randomly
bfb -n 100000 --dim 64 --skip-create --int-payloads 1000 --max-id 1000000

then created the index through dashboard

PUT /collections/benchmark/index
{
  "field_name": "c",
  "field_schema": {
    "type": "integer",
    "on_disk": true
  }
}

then did this for each build, watching the server timings in the logs, not the wall time registered by hyperfine

hyperfine "curl -X POST -H \"Content-Type: application/json\" -d '{\"query\": 3, \"filter\": {\"must\": {\"key\": \"c\", \"is_empty\": true}}}' localhost:6333/collections/benchmark/points/query" --warmup 200

coderabbitai · 2025-05-07T20:57:34Z

📝 Walkthrough

Walkthrough

The changes introduce explicit and parameterized handling of "null indexes" in the indexing subsystem. Null index creation and builder logic are refactored to static methods requiring explicit directory, field, and point count parameters. The MmapNullIndex struct now tracks the total number of points indexed, and all related methods are updated to use this count for filtering and cardinality estimation. The StructPayloadIndex is updated to always attempt to load or build a null index for each field, ensuring null indexes are integrated into both loading and building phases. The FieldCondition constructors for "is_empty" and "is_null" now require an explicit boolean argument. Tests and internal logic are updated to reflect these changes, including filtering out deleted points in test queries.

Suggested reviewers

generall

Tip

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73b2c33 and 5dcf1f8.

📒 Files selected for processing (1)

lib/segment/src/index/struct_payload_index.rs (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/index/struct_payload_index.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-low-resources
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests-consensus
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: storage-compat-test
GitHub Check: lint
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

lib/segment/src/index/struct_payload_index.rs (1)

242-245: Potential double-registration of the null builder

IndexSelector::index_builder() no longer inserts a null builder, so the extra push() is necessary now—good catch.
Just a note: if future refactors accidentally add it back inside index_builder, this line will silently create two null builders for the same field. A helper such as builders.retain(|b| !matches!(b, FieldIndexBuilder::NullIndex(_))) before the push would make the intent explicit.

lib/segment/src/index/field_index/index_selector.rs (1)

233-250: Public helpers for the null index are clear; consider naming symmetry

The new static helpers null_builder and new_null_index make the call-sites straightforward and remove RocksDB ambiguity—nice!
Two small suggestions:

The prefix new_* is already used for builders elsewhere. Renaming to open_null_index (or load_null_index) would convey that this function opens an existing on-disk index.

null_builder always creates the directory on disk (via builder()), even when the caller eventually decides not to build indexes. If that side-effect is undesirable, defer directory creation to init().

No blocking issues—just naming / side-effect clarity.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

277-279: Hardware-counter delta likely mis-estimated

incr_delta(self.has_values_slice.len() / u8::BITS as usize) uses the bitmap
length (rounded bytes) while the subsequent scan may iterate
max_point_offset + 1 elements. After the off-by-one fix the mismatch grows.

If the counter is intended to model actual element reads, changing to
(self.max_point_offset + 1) / u8::BITS as usize keeps the estimate aligned.

Also applies to: 296-298

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0b48a and 45c6eb4.

📒 Files selected for processing (4)

lib/segment/src/index/field_index/index_selector.rs (4 hunks)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (9 hunks)
lib/segment/src/index/query_estimator.rs (1 hunks)
lib/segment/src/index/struct_payload_index.rs (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lib/segment/src/index/field_index/index_selector.rs (1)

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (2)

builder (38-40)

open_if_exists (82-105)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests
GitHub Check: storage-compat-test
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: integration-tests-consensus
GitHub Check: lint

🔇 Additional comments (3)

lib/segment/src/index/query_estimator.rs (1)

251-254: Robust under-flow protection looks good, but please double-check edge-case invariants

Replacing the raw subtraction with saturating_sub prevents under-flow when estimation.* occasionally exceeds total.
One minor thing to keep an eye on: if estimation.max > total the new min becomes 0, while exp may be still non-zero (e.g. exp = total - estimation.exp when estimation.exp ≤ total). The usual invariant min ≤ exp ≤ max still holds, but the relation
min = total - estimation.max is now only approximate (clamped). If any downstream code relies on the exact algebraic symmetry, it might give slightly optimistic results.

No change requested; just ensure tests cover the estimation.* > total scenario.

lib/segment/src/index/field_index/index_selector.rs (1)

4-4: Import looks fine

PointOffsetType import is required by the new helper—✔️

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

30-31: max_point_offset initialisation / maintenance deserves stricter guarantees

The builder starts with max_point_offset = 0; fine, but for existing
indexes loaded via open_if_exists, the caller must supply the correct
value. A stale value will break the new range-based logic.

remove_point doesn’t shrink max_point_offset, so range scans can extend
far beyond live points after many deletions, wasting IO.

Consider:

deriving max_point_offset from DynamicMmapFlags::len() when loading, or

persisting it on disk alongside the bitmaps, or

tracking active_count instead and using bitmap iterators consistently.

Please double-check all open_* call-sites to ensure they now pass a
trustworthy max_point_offset. I can craft a script to grep for them if
helpful.

Also applies to: 153-155

lib/segment/src/index/struct_payload_index.rs

coderabbitai · 2025-05-07T20:57:37Z

lib/segment/src/index/struct_payload_index.rs

+        if let Some(null_index) =
+            IndexSelector::new_null_index(&self.path, field, max_point_offset)?
+        {
+            // todo: This means that null index will not be built if it is not loaded here.
+            //       Maybe we should set `is_loaded` to false to trigger index building.
+            indexes.push(null_index);
+        }
+


🛠️ Refactor suggestion

Loading logic leaves the “newly created” null index in a half-built state

The TODO is spot-on: when the directory exists but the files do not, we push a freshly created null index and immediately mark is_loaded = true, so the build phase is skipped and the bitmaps stay empty.

A minimal fix is to forward the return value of load() and include it in the is_loaded aggregation:

-for ref mut index in indexes.iter_mut() { - if !index.load()? { - is_loaded = false; - break; - } -} +for index in indexes.iter_mut() { + if !index.load()? { + is_loaded = false; + } +}

Then, if any constituent index (including the null one) fails to load, we’ll correctly rebuild the whole bundle.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs

coszio · 2025-05-08T04:06:10Z

There are some errors to solve regarding cardinality estimation

Edit: not anymore

lib/segment/src/index/field_index/null_index/mmap_null_index.rs

generall

I am not sure those big changes are actually needed. Let's discuss in a call

coszio · 2025-05-13T18:54:12Z

lib/segment/tests/integration/payload_index_test.rs

@@ -477,7 +477,7 @@ fn build_test_segments_nested_payload(path_struct: &Path, path_plain: &Path) ->

    for (_field, indexes) in struct_segment.payload_index.borrow().field_indexes.iter() {
        for index in indexes {
-            assert!(index.count_indexed_points() < num_points as usize);
+            assert!(index.count_indexed_points() <= num_points as usize);


this is modified because now the null index is built for on_disk=false as well, and null index reports the max id that has had a value, even if it was deleted (has_values_slice.len())

* take max_point_offset into consideration * add null index even if index selector is for rocksdb * fmt * better cardinality estimation, check deleted inside filtered iter * fix count indexed points, add primary conditions to estimator * use len for count_indexed_points, adjust test * remove deleted check from filtered iter * cleaner removal

coszio added 3 commits May 7, 2025 16:03

take max_point_offset into consideration

f3893b1

add null index even if index selector is for rocksdb

72d9bf4

fmt

45c6eb4

coszio requested review from timvisee and generall May 7, 2025 20:54

coderabbitai bot reviewed May 7, 2025

View reviewed changes

coszio added 2 commits May 8, 2025 10:17

better cardinality estimation, check deleted inside filtered iter

aa46ae3

fix count indexed points, add primary conditions to estimator

e98b7da

generall reviewed May 13, 2025

View reviewed changes

lib/segment/src/index/field_index/null_index/mmap_null_index.rs Outdated Show resolved Hide resolved

generall requested changes May 13, 2025

View reviewed changes

use len for count_indexed_points, adjust test

3585e09

coszio commented May 13, 2025

View reviewed changes

coszio requested a review from generall May 13, 2025 19:04

coszio added 2 commits May 14, 2025 10:29

remove deleted check from filtered iter

73b2c33

cleaner removal

5dcf1f8

github-actions bot mentioned this pull request May 14, 2025

Flaky test index::tests::hw_counter_test::test_hw_counter_for_plain_sparse_search #6231

Closed

generall approved these changes May 14, 2025

View reviewed changes

coszio merged commit e2d1e64 into dev May 14, 2025
17 checks passed

coszio deleted the fix-null-index branch May 14, 2025 18:55

This was referenced Jul 14, 2025

Fix mmap payload index load, load in stages #6860

Closed

Fix broken is_empty filter on new points #6882

Merged

coderabbitai bot mentioned this pull request Aug 5, 2025

New mutable null index #6954

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use null index more #6504

Use null index more #6504

Uh oh!

coszio commented May 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented May 7, 2025 •

edited

Loading

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot May 7, 2025

Uh oh!

Uh oh!

Uh oh!

coszio commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

generall left a comment

Uh oh!

coszio May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Use null index more #6504

Use null index more #6504

Uh oh!

Conversation

coszio commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perf gains

Uh oh!

coderabbitai bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coszio commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

generall left a comment

Choose a reason for hiding this comment

Uh oh!

coszio May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coszio commented May 7, 2025 •

edited

Loading

coderabbitai bot commented May 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coszio commented May 8, 2025 •

edited

Loading