-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Use null index more #6504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use null index more #6504
Conversation
📝 WalkthroughWalkthroughThe changes introduce explicit and parameterized handling of "null indexes" in the indexing subsystem. Null index creation and builder logic are refactored to static methods requiring explicit directory, field, and point count parameters. The MmapNullIndex struct now tracks the total number of points indexed, and all related methods are updated to use this count for filtering and cardinality estimation. The StructPayloadIndex is updated to always attempt to load or build a null index for each field, ensuring null indexes are integrated into both loading and building phases. The FieldCondition constructors for "is_empty" and "is_null" now require an explicit boolean argument. Tests and internal logic are updated to reflect these changes, including filtering out deleted points in test queries. Suggested reviewers
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (3)
lib/segment/src/index/struct_payload_index.rs (1)
242-245
: Potential double-registration of the null builder
IndexSelector::index_builder()
no longer inserts a null builder, so the extrapush()
is necessary now—good catch.
Just a note: if future refactors accidentally add it back insideindex_builder
, this line will silently create two null builders for the same field. A helper such asbuilders.retain(|b| !matches!(b, FieldIndexBuilder::NullIndex(_)))
before the push would make the intent explicit.lib/segment/src/index/field_index/index_selector.rs (1)
233-250
: Public helpers for the null index are clear; consider naming symmetryThe new static helpers
null_builder
andnew_null_index
make the call-sites straightforward and remove RocksDB ambiguity—nice!
Two small suggestions:
The prefix
new_*
is already used for builders elsewhere. Renaming toopen_null_index
(orload_null_index
) would convey that this function opens an existing on-disk index.
null_builder
always creates the directory on disk (viabuilder()
), even when the caller eventually decides not to build indexes. If that side-effect is undesirable, defer directory creation toinit()
.No blocking issues—just naming / side-effect clarity.
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)
277-279
: Hardware-counter delta likely mis-estimated
incr_delta(self.has_values_slice.len() / u8::BITS as usize)
uses the bitmap
length (rounded bytes) while the subsequent scan may iterate
max_point_offset + 1
elements. After the off-by-one fix the mismatch grows.If the counter is intended to model actual element reads, changing to
(self.max_point_offset + 1) / u8::BITS as usize
keeps the estimate aligned.Also applies to: 296-298
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/segment/src/index/field_index/index_selector.rs
(4 hunks)lib/segment/src/index/field_index/null_index/mmap_null_index.rs
(9 hunks)lib/segment/src/index/query_estimator.rs
(1 hunks)lib/segment/src/index/struct_payload_index.rs
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/field_index/index_selector.rs (1)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (2)
builder
(38-40)open_if_exists
(82-105)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
🔇 Additional comments (3)
lib/segment/src/index/query_estimator.rs (1)
251-254
: Robust under-flow protection looks good, but please double-check edge-case invariantsReplacing the raw subtraction with
saturating_sub
prevents under-flow whenestimation.*
occasionally exceedstotal
.
One minor thing to keep an eye on: ifestimation.max > total
the newmin
becomes0
, whileexp
may be still non-zero (e.g.exp = total - estimation.exp
whenestimation.exp ≤ total
). The usual invariantmin ≤ exp ≤ max
still holds, but the relation
min = total - estimation.max
is now only approximate (clamped). If any downstream code relies on the exact algebraic symmetry, it might give slightly optimistic results.No change requested; just ensure tests cover the
estimation.* > total
scenario.lib/segment/src/index/field_index/index_selector.rs (1)
4-4
: Import looks fine
PointOffsetType
import is required by the new helper—✔️lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)
30-31
:max_point_offset
initialisation / maintenance deserves stricter guarantees
- The builder starts with
max_point_offset = 0
; fine, but for existing
indexes loaded viaopen_if_exists
, the caller must supply the correct
value. A stale value will break the new range-based logic.remove_point
doesn’t shrinkmax_point_offset
, so range scans can extend
far beyond live points after many deletions, wasting IO.Consider:
- deriving
max_point_offset
fromDynamicMmapFlags::len()
when loading, or- persisting it on disk alongside the bitmaps, or
- tracking
active_count
instead and using bitmap iterators consistently.Please double-check all
open_*
call-sites to ensure they now pass a
trustworthymax_point_offset
. I can craft a script to grep for them if
helpful.Also applies to: 153-155
if let Some(null_index) = | ||
IndexSelector::new_null_index(&self.path, field, max_point_offset)? | ||
{ | ||
// todo: This means that null index will not be built if it is not loaded here. | ||
// Maybe we should set `is_loaded` to false to trigger index building. | ||
indexes.push(null_index); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Loading logic leaves the “newly created” null index in a half-built state
The TODO is spot-on: when the directory exists but the files do not, we push a freshly created null index and immediately mark is_loaded = true
, so the build phase is skipped and the bitmaps stay empty.
A minimal fix is to forward the return value of load()
and include it in the is_loaded
aggregation:
-for ref mut index in indexes.iter_mut() {
- if !index.load()? {
- is_loaded = false;
- break;
- }
-}
+for index in indexes.iter_mut() {
+ if !index.load()? {
+ is_loaded = false;
+ }
+}
Then, if any constituent index (including the null one) fails to load, we’ll correctly rebuild the whole bundle.
lib/segment/src/index/field_index/null_index/mmap_null_index.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/null_index/mmap_null_index.rs
Outdated
Show resolved
Hide resolved
There are some errors to solve regarding cardinality estimation Edit: not anymore |
lib/segment/src/index/field_index/null_index/mmap_null_index.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure those big changes are actually needed. Let's discuss in a call
@@ -477,7 +477,7 @@ fn build_test_segments_nested_payload(path_struct: &Path, path_plain: &Path) -> | |||
|
|||
for (_field, indexes) in struct_segment.payload_index.borrow().field_indexes.iter() { | |||
for index in indexes { | |||
assert!(index.count_indexed_points() < num_points as usize); | |||
assert!(index.count_indexed_points() <= num_points as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is modified because now the null index is built for on_disk=false
as well, and null index reports the max id that has had a value, even if it was deleted (has_values_slice.len()
)
* take max_point_offset into consideration * add null index even if index selector is for rocksdb * fmt * better cardinality estimation, check deleted inside filtered iter * fix count indexed points, add primary conditions to estimator * use len for count_indexed_points, adjust test * remove deleted check from filtered iter * cleaner removal
Currently null index was only working for
is_empty=false
andis_null=true
cases.By making the index aware of the
max_point_offset
available, we can also make it work foris_empty=true
andis_null=false
.This PR also builds the index even if the
IndexSelector
chooses RocksDB-based index. Now the null index is chosen regardless of appendable and on-disk parameters.Perf gains
In some local manual tests, I can see about 1.8x difference between dev and branch. It is quite small (~2.5ms vs ~1.4ms) but noticeable
then created the index through dashboard
then did this for each build, watching the server timings in the logs, not the wall time registered by hyperfine
hyperfine "curl -X POST -H \"Content-Type: application/json\" -d '{\"query\": 3, \"filter\": {\"must\": {\"key\": \"c\", \"is_empty\": true}}}' localhost:6333/collections/benchmark/points/query" --warmup 200