Skip to content

Conversation

coszio
Copy link
Contributor

@coszio coszio commented May 7, 2025

Currently null index was only working for is_empty=false and is_null=true cases.

By making the index aware of the max_point_offset available, we can also make it work for is_empty=true and is_null=false.

This PR also builds the index even if the IndexSelector chooses RocksDB-based index. Now the null index is chosen regardless of appendable and on-disk parameters.

Perf gains

In some local manual tests, I can see about 1.8x difference between dev and branch. It is quite small (~2.5ms vs ~1.4ms) but noticeable

# upload 1M points without payload
bfb -n 1000000 --dim 64

# this will apply int payloads to ~10% of points, randomly
bfb -n 100000 --dim 64 --skip-create --int-payloads 1000 --max-id 1000000

then created the index through dashboard

PUT /collections/benchmark/index
{
  "field_name": "c",
  "field_schema": {
    "type": "integer",
    "on_disk": true
  }
}

then did this for each build, watching the server timings in the logs, not the wall time registered by hyperfine

hyperfine "curl -X POST -H \"Content-Type: application/json\" -d '{\"query\": 3, \"filter\": {\"must\": {\"key\": \"c\", \"is_empty\": true}}}' localhost:6333/collections/benchmark/points/query" --warmup 200

@coszio coszio requested review from timvisee and generall May 7, 2025 20:54
Copy link
Contributor

coderabbitai bot commented May 7, 2025

📝 Walkthrough

Walkthrough

The changes introduce explicit and parameterized handling of "null indexes" in the indexing subsystem. Null index creation and builder logic are refactored to static methods requiring explicit directory, field, and point count parameters. The MmapNullIndex struct now tracks the total number of points indexed, and all related methods are updated to use this count for filtering and cardinality estimation. The StructPayloadIndex is updated to always attempt to load or build a null index for each field, ensuring null indexes are integrated into both loading and building phases. The FieldCondition constructors for "is_empty" and "is_null" now require an explicit boolean argument. Tests and internal logic are updated to reflect these changes, including filtering out deleted points in test queries.

Suggested reviewers

  • generall

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73b2c33 and 5dcf1f8.

📒 Files selected for processing (1)
  • lib/segment/src/index/struct_payload_index.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/segment/src/index/struct_payload_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
lib/segment/src/index/struct_payload_index.rs (1)

242-245: Potential double-registration of the null builder

IndexSelector::index_builder() no longer inserts a null builder, so the extra push() is necessary now—good catch.
Just a note: if future refactors accidentally add it back inside index_builder, this line will silently create two null builders for the same field. A helper such as builders.retain(|b| !matches!(b, FieldIndexBuilder::NullIndex(_))) before the push would make the intent explicit.

lib/segment/src/index/field_index/index_selector.rs (1)

233-250: Public helpers for the null index are clear; consider naming symmetry

The new static helpers null_builder and new_null_index make the call-sites straightforward and remove RocksDB ambiguity—nice!
Two small suggestions:

  1. The prefix new_* is already used for builders elsewhere. Renaming to open_null_index (or load_null_index) would convey that this function opens an existing on-disk index.

  2. null_builder always creates the directory on disk (via builder()), even when the caller eventually decides not to build indexes. If that side-effect is undesirable, defer directory creation to init().

No blocking issues—just naming / side-effect clarity.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

277-279: Hardware-counter delta likely mis-estimated

incr_delta(self.has_values_slice.len() / u8::BITS as usize) uses the bitmap
length (rounded bytes) while the subsequent scan may iterate
max_point_offset + 1 elements. After the off-by-one fix the mismatch grows.

If the counter is intended to model actual element reads, changing to
(self.max_point_offset + 1) / u8::BITS as usize keeps the estimate aligned.

Also applies to: 296-298

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0b48a and 45c6eb4.

📒 Files selected for processing (4)
  • lib/segment/src/index/field_index/index_selector.rs (4 hunks)
  • lib/segment/src/index/field_index/null_index/mmap_null_index.rs (9 hunks)
  • lib/segment/src/index/query_estimator.rs (1 hunks)
  • lib/segment/src/index/struct_payload_index.rs (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/field_index/index_selector.rs (1)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (2)
  • builder (38-40)
  • open_if_exists (82-105)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: lint
🔇 Additional comments (3)
lib/segment/src/index/query_estimator.rs (1)

251-254: Robust under-flow protection looks good, but please double-check edge-case invariants

Replacing the raw subtraction with saturating_sub prevents under-flow when estimation.* occasionally exceeds total.
One minor thing to keep an eye on: if estimation.max > total the new min becomes 0, while exp may be still non-zero (e.g. exp = total - estimation.exp when estimation.exp ≤ total). The usual invariant min ≤ exp ≤ max still holds, but the relation
min = total - estimation.max is now only approximate (clamped). If any downstream code relies on the exact algebraic symmetry, it might give slightly optimistic results.

No change requested; just ensure tests cover the estimation.* > total scenario.

lib/segment/src/index/field_index/index_selector.rs (1)

4-4: Import looks fine

PointOffsetType import is required by the new helper—✔️

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

30-31: max_point_offset initialisation / maintenance deserves stricter guarantees

  1. The builder starts with max_point_offset = 0; fine, but for existing
    indexes loaded via open_if_exists, the caller must supply the correct
    value. A stale value will break the new range-based logic.
  2. remove_point doesn’t shrink max_point_offset, so range scans can extend
    far beyond live points after many deletions, wasting IO.

Consider:

  • deriving max_point_offset from DynamicMmapFlags::len() when loading, or
  • persisting it on disk alongside the bitmaps, or
  • tracking active_count instead and using bitmap iterators consistently.

Please double-check all open_* call-sites to ensure they now pass a
trustworthy max_point_offset. I can craft a script to grep for them if
helpful.

Also applies to: 153-155

Comment on lines 146 to 153
if let Some(null_index) =
IndexSelector::new_null_index(&self.path, field, max_point_offset)?
{
// todo: This means that null index will not be built if it is not loaded here.
// Maybe we should set `is_loaded` to false to trigger index building.
indexes.push(null_index);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Loading logic leaves the “newly created” null index in a half-built state

The TODO is spot-on: when the directory exists but the files do not, we push a freshly created null index and immediately mark is_loaded = true, so the build phase is skipped and the bitmaps stay empty.

A minimal fix is to forward the return value of load() and include it in the is_loaded aggregation:

-for ref mut index in indexes.iter_mut() {
-    if !index.load()? {
-        is_loaded = false;
-        break;
-    }
-}
+for index in indexes.iter_mut() {
+    if !index.load()? {
+        is_loaded = false;
+    }
+}

Then, if any constituent index (including the null one) fails to load, we’ll correctly rebuild the whole bundle.

@coszio
Copy link
Contributor Author

coszio commented May 8, 2025

There are some errors to solve regarding cardinality estimation

Edit: not anymore

Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure those big changes are actually needed. Let's discuss in a call

@@ -477,7 +477,7 @@ fn build_test_segments_nested_payload(path_struct: &Path, path_plain: &Path) ->

for (_field, indexes) in struct_segment.payload_index.borrow().field_indexes.iter() {
for index in indexes {
assert!(index.count_indexed_points() < num_points as usize);
assert!(index.count_indexed_points() <= num_points as usize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is modified because now the null index is built for on_disk=false as well, and null index reports the max id that has had a value, even if it was deleted (has_values_slice.len())

@coszio coszio requested a review from generall May 13, 2025 19:04
@coszio coszio merged commit e2d1e64 into dev May 14, 2025
17 checks passed
@coszio coszio deleted the fix-null-index branch May 14, 2025 18:55
generall pushed a commit that referenced this pull request May 22, 2025
* take max_point_offset into consideration

* add null index even if index selector is for rocksdb

* fmt

* better cardinality estimation, check deleted inside filtered iter

* fix count indexed points, add primary conditions to estimator

* use len for count_indexed_points, adjust test

* remove deleted check from filtered iter

* cleaner removal
@coderabbitai coderabbitai bot mentioned this pull request Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants