Skip to content

Conversation

generall
Copy link
Member

@generall generall commented Mar 2, 2025

  • create initial strucutres
  • clippy
  • start field-query refactoring
  • start field-query refactoring (2/N)
  • start field-query refactoring (3/N): duplicate is_empty/null condiftions as field condition

Copy link
Contributor

coderabbitai bot commented Mar 2, 2025

📝 Walkthrough

Walkthrough

This pull request introduces two new fields, is_empty and is_null, across multiple components. The changes extend the functionality of the FieldCondition structures and related conversion, validation, and query mechanisms. Updates span documentation, OpenAPI specifications, gRPC proto files, and conversion logic, with corresponding modifications in indexing and query optimization modules. In addition, a new null index type along with associated builder and helper methods has been added to support efficient handling of null and empty conditions. Tests and API endpoints have been updated to verify correct behavior regarding these conditions. The changes ensure that filtering, indexing, and payload evaluation now explicitly account for cases where fields may be empty or null.

Suggested reviewers

  • xzfc
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@generall generall changed the title WIP: IsEmpty/IsNull index IsEmpty/IsNull index Mar 2, 2025
@generall generall marked this pull request as ready for review March 2, 2025 22:12
for value in payload {
match value {
Value::Null => {
is_null = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_null = true;
has_values = true;
is_null = true;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we don't consider null as a value

@@ -8397,6 +8397,16 @@
"nullable": true
}
]
},
"is_empty": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both alternative syntaxes need an openapi integration test for sanity check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, found couple of problems with new tests

Co-authored-by: Tim Visée <tim+github@visee.me>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (12)
docs/grpc/docs.md (1)

2802-2803: New Boolean Fields Added for Field Condition Checks

The two optional boolean fields, is_empty and is_null, have been introduced in the FieldCondition schema. Their descriptions (“Check if field is empty” and “Check if field is null”) clearly correspond to the PR objectives of improving empty/null field handling. Please ensure that these additions are also reflected consistently in all related documentation and API specs (e.g., OpenAPI, proto files) so that users can accurately understand their intended usage.

lib/segment/src/vector_storage/dense/dynamic_mmap_flags.rs (1)

217-238: New convenience method for dynamic flag resizing.

The set_with_resize method is a valuable addition that provides amortized growth when needed. The implementation handles edge cases well:

  1. Only resizes when necessary (when the index is beyond current length)
  2. Only allocates new space when setting a true value (since false is the default)
  3. Returns the previous value correctly

This will simplify client code that needs to work with dynamically sized flags.

Consider adding a unit test specifically for the set_with_resize method to ensure its behavior is verified, especially for edge cases like:

  • Setting a value beyond current length
  • Setting false beyond current length (should not resize)
  • Verifying the return value is correct both for existing and new indices
#[test]
fn test_set_with_resize() {
    let dir = Builder::new().prefix("storage_dir").tempdir().unwrap();
    let mut dynamic_flags = DynamicMmapFlags::open(dir.path()).unwrap();
    
    // Initialize with length 10
    dynamic_flags.set_len(10).unwrap();
    
    // Test: Set true beyond current length
    let prev = dynamic_flags.set_with_resize(20, true).unwrap();
    assert_eq!(prev, false); // Previous value should be false
    assert!(dynamic_flags.len() >= 21); // Length should be extended
    assert!(dynamic_flags.get(20)); // New value should be true
    
    // Test: Set false beyond current length
    let current_len = dynamic_flags.len();
    let prev = dynamic_flags.set_with_resize(current_len + 10, false).unwrap();
    assert_eq!(prev, false); // Previous value should be false
    assert_eq!(dynamic_flags.len(), current_len); // Length should not change
    
    // Test: Toggle existing value
    dynamic_flags.set(5, true).unwrap();
    let prev = dynamic_flags.set_with_resize(5, false).unwrap();
    assert_eq!(prev, true); // Previous value should be true
    assert!(!dynamic_flags.get(5)); // New value should be false
}
lib/segment/tests/integration/payload_index_test.rs (1)

622-661: Great refactoring to improve code clarity and maintainability.

The refactored code improves readability by:

  1. Separating checks for each key (INT_KEY, INT_KEY_2, INT_KEY_3) into distinct blocks
  2. Using iterator methods with any() to check for specific index types
  3. Using clear assertions with meaningful error messages

This makes the test logic easier to understand and maintain, particularly when debugging failures.

lib/segment/src/index/field_index/index_selector.rs (4)

53-84: Consider extracting repeated logic in the match arms
The expanded match arms now manage a mutable indexes collection to push multiple index types, including a null index. While fully functional, this section is rather large. Extracting common logic into helper functions could improve maintainability.


86-88: Fallback to new_null_index is coherent
This approach gracefully adds a null index when available. As seen in the comment, a ToDo remains for the RocksDb case. Logging a warning or delegating a partial index creation could guide future development.


239-246: null_builder is straightforward but incomplete for RocksDb
The function currently returns None for RocksDb. This is acceptable as a first step, but keep in mind the “ToDo” comment and incorporate RocksDb support to align with other builder methods.


248-255: new_null_index mirrors the builder logic
This follows the same RocksDb/OnDisk distinction. The code is concise. Future expansions could add an appendable approach for RocksDb.

lib/segment/src/index/struct_payload_index.rs (1)

357-359: Fallback to full ID iteration if no index is found
Relying on a full scan when query_field returns None might have performance implications for large segments. Consider adding trace logging or user notification if repeated usage of this fallback is expected.

lib/segment/src/index/query_optimization/condition_converter.rs (2)

55-83: Dedicated null index usage for Condition::IsEmpty
Leveraging a pre-built null index significantly boosts efficiency for empty checks. Fallback mechanisms for payload inspection are sensible. You might add a debug log when defaulting to the payload-based approach.


415-435: get_is_empty_checker remains limited to NullIndex
As implemented, this only returns a checker for NullIndex and otherwise yields None. That’s appropriate for the current scope, but future expansions could incorporate boolean or other specialized indexes if needed.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

204-246: Consider partial iteration for non-empty conditions.
When filtering for non-empty or non-null fields, the logic returns None and expects external iteration. If many points exist, this approach might be less efficient. Consider providing an explicit iterator for all points not set, or discuss a fallback approach to reduce overhead.

lib/segment/src/types.rs (1)

2230-2242: Add coverage for new is_empty and is_null constructors.
The new builders are convenient but ensure they’re tested. Consider unit tests or integration tests verifying filters like { "is_empty": true } or { "is_null": true } to confirm query correctness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fd3bba and d0fc25b.

📒 Files selected for processing (22)
  • docs/grpc/docs.md (1 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (4 hunks)
  • lib/api/src/grpc/proto/points.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (1 hunks)
  • lib/api/src/grpc/validate.rs (2 hunks)
  • lib/segment/src/index/field_index/field_index_base.rs (18 hunks)
  • lib/segment/src/index/field_index/index_selector.rs (7 hunks)
  • lib/segment/src/index/field_index/mod.rs (2 hunks)
  • lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1 hunks)
  • lib/segment/src/index/field_index/null_index/mod.rs (1 hunks)
  • lib/segment/src/index/query_estimator.rs (2 hunks)
  • lib/segment/src/index/query_optimization/condition_converter.rs (9 hunks)
  • lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (6 hunks)
  • lib/segment/src/index/query_optimization/rescore_formula/value_retriever.rs (1 hunks)
  • lib/segment/src/index/struct_payload_index.rs (4 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (2 hunks)
  • lib/segment/src/problems/unindexed_field.rs (2 hunks)
  • lib/segment/src/segment_constructor/segment_builder.rs (1 hunks)
  • lib/segment/src/types.rs (10 hunks)
  • lib/segment/src/vector_storage/dense/dynamic_mmap_flags.rs (3 hunks)
  • lib/segment/tests/integration/payload_index_test.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • lib/segment/src/index/field_index/null_index/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
🔇 Additional comments (57)
lib/api/src/grpc/qdrant.rs (1)

6281-6286: LGTM - Properly implemented new field condition checks.

The addition of is_empty and is_null optional boolean fields to the FieldCondition message is implemented correctly with appropriate protobuf tags, types, and documentation. This enhancement aligns well with the PR objective of improving empty/null field handling in field queries.

From the PR objectives and AI summary, it appears these fields are consistently added across the codebase to support checking for empty or null field values, which is a useful enhancement to the filtering capabilities.

lib/segment/src/segment_constructor/segment_builder.rs (1)

244-244: LGTM! Correctly handles the new NullIndex case.

The match arm FieldIndex::NullIndex(_) => {} properly follows the existing pattern for index types that don't contribute to ordering values. This addition is necessary to support the new null indexing capability and is implemented consistently with other similar index types.

lib/api/src/grpc/proto/points.proto (1)

1048-1049: Good addition of empty/null field checks to enhance filtering capabilities.

These new optional boolean fields enable explicit filtering on empty and null field states, which is a valuable feature for queries. The implementation follows the protobuf best practice of using optional fields with descriptive comments.

lib/segment/src/index/query_optimization/rescore_formula/value_retriever.rs (1)

178-178: LGTM! Properly handles the new NullIndex case.

The handler correctly returns None for NullIndex, which is the expected behavior since null values don't have retrievable content. The comment also provides valuable context that this index type is expected to be used alongside other indices for the same field.

lib/segment/src/problems/unindexed_field.rs (2)

220-222: LGTM! Successfully extracted the new fields from the FieldCondition struct.

The addition of is_empty and is_null to the destructuring pattern correctly extracts these new fields from the FieldCondition struct for further processing.


259-262: LGTM! Enhanced condition to support new field checks.

The updated condition now correctly considers the new is_empty and is_null fields when determining if indexes should be inferred. This ensures that queries using these new field checks will have the opportunity to benefit from appropriate indexing.

lib/api/src/grpc/conversions.rs (4)

1376-1377: Proper integration of new field conditions.

The addition of is_empty and is_null fields to the FieldCondition struct is a good enhancement, allowing for more expressive query conditions to check for empty or null values.


1400-1401: Consistent field handling in type conversion.

The proper handling of is_empty and is_null fields in the try_from implementation ensures field conditions are correctly transferred from gRPC representations to internal types.


1416-1417: Appropriate field inclusion in struct destructuring.

The inclusion of the new fields in struct destructuring ensures they're properly captured during conversion.


1435-1436: Bidirectional conversion support for new fields.

The proper mapping of is_empty and is_null fields in the reverse conversion direction ensures consistency in data transformation between internal and API representations.

lib/segment/src/index/field_index/mod.rs (2)

5-5: Simplified type imports matching refactored condition types.

The import statement has been updated to remove IsEmptyCondition and IsNullCondition types, which aligns with the refactoring to integrate these condition types into the FieldCondition struct.


27-32:

✅ Verification successful

Simplified PrimaryCondition enum with integrated condition handling.

The PrimaryCondition enum has been simplified by removing dedicated variants for IsEmpty and IsNull conditions. This is a good refactoring that reduces duplication and simplifies the condition handling by integrating these checks into the FieldCondition struct.


🏁 Script executed:

#!/bin/bash
# Find usages of PrimaryCondition to confirm the refactoring is applied consistently
rg "PrimaryCondition" --type rust

Length of output: 7045


PrimaryCondition Refactor Verified and Approved

The updated PrimaryCondition enum shows a consistent application of refactoring by removing the dedicated IsEmpty and IsNull variants in favor of using FieldCondition for condition handling. The grep output confirms that all usages across tests and index implementations now rely on the Condition variant (as well as the Ids and HasVector variants), ensuring that the refactoring is applied uniformly without impacting the overall functionality.

  • The simplified enum in lib/segment/src/index/field_index/mod.rs (lines 27-32) is correctly reflected.
  • All references in integration tests and various index modules consistently use the PrimaryCondition::Condition variant.
  • The integration of condition handling into FieldCondition reduces duplication and streamlines the logic.
lib/api/src/grpc/validate.rs (2)

209-211: Thorough field extraction in validation.

The destructuring of the FieldCondition struct has been updated to include the new is_empty and is_null fields, ensuring they are properly accessible for validation.


219-221: Complete validation of all field condition types.

The all_fields_none check has been extended to include the new fields, ensuring that validation correctly identifies when no field conditions are specified. This maintains the requirement that at least one field condition must be specified.

lib/segment/src/vector_storage/dense/dynamic_mmap_flags.rs (1)

148-150: Enhanced documentation for set_len behavior.

The addition of documentation clarifying the capacity behavior is helpful. It clearly communicates that capacity can be up to twice the current length, which is important for understanding performance implications.

lib/segment/src/index/field_index/field_index_base.rs (12)

22-22: Appropriate import for the new null index types.

This import correctly brings in the necessary types for supporting null indexing functionality.


124-124: Good addition of NullIndex variant.

Adding the NullIndex variant to the FieldIndex enum is a clean way to extend the system's capability to handle null values explicitly.


140-140: Debug implementation for NullIndex properly added.

The Debug implementation correctly follows the established pattern for other index types.


181-182: Consistent implementation for special_check_condition.

The implementation returns None for NullIndex, following the same pattern as most other index types.


197-198: Correct delegation to payload_field_index in get_payload_field_index.

The implementation correctly returns the payload_field_index for the NullIndex variant.


213-214: Comprehensive implementation across all FieldIndex methods.

All necessary methods for handling the NullIndex variant are properly implemented, including load, cleanup, add_point, remove_point, get_telemetry_data, values_count, and values_is_empty. Each follows the established pattern for other index types.

Also applies to: 229-230, 297-299, 315-316, 331-332, 347-348, 363-364


378-380: Correct exclusion of NullIndex from numeric representation.

The NullIndex variant is correctly excluded from having a numeric representation, similar to FullTextIndex and other non-numeric indices.


394-396: Appropriate exclusion of NullIndex from facet index representation.

The NullIndex is correctly excluded from having a facet index representation, following the pattern for other index types that don't support this functionality.


444-444: Builder pattern correctly extended with NullIndex variant.

The FieldIndexBuilder enum is properly extended with the NullIndex variant to support the creation of null indices.


470-471: Complete implementation of builder methods for NullIndex.

All necessary methods (init, add_point, finalize) for the NullIndex builder are properly implemented, maintaining consistency with other index types.

Also applies to: 496-497, 520-521


22-521:

✅ Verification successful

Verify integration with the rest of the system.

The implementation looks solid, but it's important to ensure that the system correctly uses this new index type for null value queries. Make sure that:

  1. The null index is created when needed
  2. Queries properly utilize the null index when filtering for null values
  3. The indexing mechanism correctly identifies and indexes null values

🏁 Script executed:

#!/bin/bash
# Description: Verify how the NullIndex is used in the codebase

# Search for where NullIndex is created and built
echo "Checking for NullIndex creation:"
rg -A 5 "NullIndex|null_index" --type rust

# Check if there's code that decides when to create a NullIndex
echo -e "\nChecking index selection logic:"
rg -A 10 "index.*selector|select.*index" --type rust

# Check if null fields are properly handled in queries
echo -e "\nChecking query handling of null fields:"
rg -A 5 "is_null|null.*field|field.*null" --type rust

Length of output: 73216


Review Approved: NullIndex Integration Verified

The integration of the new NullIndex has been thoroughly verified across the codebase. The verification output confirms that:

  • The NullIndex is created when needed (e.g. in the index selector using MmapNullIndex::builder).
  • Query logic correctly handles null fields by explicitly checking for FieldIndex::NullIndex in condition converters and payload estimations.
  • The indexing mechanism properly identifies and indexes null values (via values_is_null and related functions).

No further action is required.


22-22:

❓ Verification inconclusive

Verify correctness of the null index implementation.

Let's make sure the imported MmapNullIndex and MmapNullIndexBuilder are properly implemented.


🏁 Script executed:

#!/bin/bash
# Description: Examine the implementation of null index

# Check the MmapNullIndex implementation
echo "Examining MmapNullIndex implementation:"
fd -e rs "null_index.*mmap_null_index\.rs" | xargs cat

# Check if there are tests for the null index functionality
echo -e "\nChecking for null index tests:"
rg -A 5 "test.*null_index|null_index.*test" --type rust

Length of output: 280


Action: Manually verify the null index implementation.

Our automated scans did not return any results for the definitions of MmapNullIndex and MmapNullIndexBuilder. Please ensure that these components are implemented correctly in the codebase and that the import path in lib/segment/src/index/field_index/field_index_base.rs is accurate. If the files exist under a different name or location, update the import accordingly.



🏁 Script executed:

#!/bin/bash
# Search for definitions of MmapNullIndex and MmapNullIndexBuilder in the repository

echo "Searching for 'struct MmapNullIndex' definition:"
rg -n "struct\s\+MmapNullIndex" --type rust

echo -e "\nSearching for 'struct MmapNullIndexBuilder' definition:"
rg -n "struct\s\+MmapNullIndexBuilder" --type rust

echo -e "\nSearching for file names containing 'mmap_null_index':"
fd -i 'mmap_null_index'

Length of output: 563


Action: Please manually verify the implementation.

Our automated checks confirm that the file exists at
lib/segment/src/index/field_index/null_index/mmap_null_index.rs,
but we couldn’t locate explicit definitions for MmapNullIndex or MmapNullIndexBuilder using automated searches. It’s recommended to manually inspect the file to ensure that these types (or their equivalents) are properly defined and that their behavior matches the intended functionality. Also, verify that corresponding tests exist to validate their correctness.

  • File to check: lib/segment/src/index/field_index/null_index/mmap_null_index.rs
  • Focus Areas:
    • Ensure that MmapNullIndex and MmapNullIndexBuilder are either explicitly defined or correctly re-exported.
    • Confirm that public interfaces and related tests cover the null index functionality.
docs/redoc/master/openapi.json (3)

8358-8358: Documentation improvement for geo_bounding_box description

This change correctly updates the terminology from "geo location" to "geolocation" for better consistency in the documentation.


8401-8405: Addition of is_empty field checking capability

This new field provides an alternative syntax for checking if a specified field is empty, enhancing the filtering capabilities. The description is clear and the field is correctly marked as nullable since it's optional.

As per a previous comment by agourlay, integration tests should be added for this alternative syntax to ensure it works as expected. Please verify that integration tests have been added for this feature.


8406-8410: Addition of is_null field checking capability

This new field adds the ability to check if a specified field is null using an alternative syntax. Similar to is_empty, this enhances the filtering capabilities and is well-documented.

As with the is_empty field, please ensure that integration tests are added for this alternative syntax as recommended by agourlay's previous comment to verify correct behavior in various scenarios.

lib/segment/src/index/query_estimator.rs (3)

286-287: Nice addition of the new fields to support empty and null checks.

These fields will allow for more flexible filtering capabilities in accordance with the PR objective of implementing IsEmpty/IsNull indexing.


331-333: Good refactoring of the IsEmpty condition implementation.

The change properly encapsulates the condition within a generic PrimaryCondition::Condition containing a boxed FieldCondition::new_is_empty(). This approach aligns with the PR's objective of creating a more consistent structure for field conditions.


339-341: Good refactoring of the IsNull condition implementation.

Similar to the IsEmpty change, this properly encapsulates the condition using PrimaryCondition::Condition with a boxed FieldCondition::new_is_null(), which maintains consistency in how conditions are represented.

lib/segment/src/payload_storage/condition_checker.rs (3)

46-47: Cleanly integrated the new fields in the struct destructuring.

The new is_empty and is_null fields are properly added to the destructuring pattern, ensuring they're available for use in the function body.


71-78: Well-implemented matching logic for the is_empty condition.

The implementation correctly handles different value types:

  • For Value::Null, it returns the is_empty value directly
  • For scalar types (Bool, Number, String, Object), it returns the negation of is_empty
  • For arrays, it checks if the array's emptiness matches the requested is_empty value

This aligns with the documented behavior where "this condition will match all records where the field reports either does not exist, or has null or [] value."


79-86: Well-implemented matching logic for the is_null condition.

The implementation correctly handles different value types:

  • For Value::Null, it returns the is_null value directly
  • For scalar types (Bool, Number, String, Object), it returns the negation of is_null
  • For arrays, it checks if any element is null and compares that to the is_null value

This provides a logical extension to the null checking capabilities.

lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (4)

56-56: Complete coverage of NullIndex in match_value_checker.

The additions properly integrate the new NullIndex type into the match patterns, ensuring it's handled appropriately for all value variant types (Bool, Integer, String). This ensures that value matching will work correctly with null indexes.

Also applies to: 66-66, 74-75


133-133: Complete coverage of NullIndex in match_any_checker.

The additions properly integrate the new NullIndex type into the match patterns for both AnyVariants::Integers and AnyVariants::Strings, ensuring that "any" matching will work correctly with null indexes.

Also applies to: 141-142


196-196: Complete coverage of NullIndex in match_except_checker.

The additions properly integrate the new NullIndex type into the match patterns for both AnyVariants::Strings and AnyVariants::Integers, ensuring that "except" matching will work correctly with null indexes.

Also applies to: 205-206


235-236: Complete coverage of NullIndex in match_text_checker.

The additions properly integrate both UuidMapIndex and NullIndex types into the match patterns, ensuring that text matching will work correctly with these index types.

lib/segment/src/index/field_index/index_selector.rs (2)

23-23: Use of MmapNullIndex import is valid
This import is necessary for the newly introduced null index functionality.


358-360: null_dir extension is consistent
This function correctly follows the pattern used in other directory helpers such as map_dir and numeric_dir.

lib/segment/src/index/struct_payload_index.rs (3)

40-41: Expanded imports for payload schema
The inclusion of PayloadContainer and similar items at these lines aligns with new usage in the cardinality and condition code paths.


253-257: IsEmpty condition bridging to FieldCondition
This snippet transforms an IsEmptyCondition into a full FieldCondition for estimation. The pattern is consistent with how other conditions are handled.


260-264: IsNull condition bridging to FieldCondition
Similarly, IsNullCondition is converted into a FieldCondition and carries through the same evaluation flow. This fosters uniformity in condition handling.

lib/segment/src/index/query_optimization/condition_converter.rs (6)

9-9: Added MmapNullIndex import
Required for indexing null/empty values. Import is correct.


85-107: Custom logic for Condition::IsNull
This logic cleanly attempts a null-index-based check first, then gracefully defaults to payload inspection. The approach is consistent and reduces code duplication in repeated checks.


269-270: Added NullIndex to the match arms returning None
For geo or numeric checks, NullIndex is correctly bypassed to the fallback because it doesn’t handle these operations. This explicit coverage prevents silent mismatches.

Also applies to: 291-292, 312-313, 345-346, 369-370


373-389: get_is_empty_indexes provides a primary/fallback strategy
Splitting out null-index-based checks from any other available index is clear. The function nicely returns a two-part tuple to guide the next step.


391-413: get_fallback_is_empty_checker merges index checks with payload
Combining index results with a final payload reference is pragmatic to ensure consistency. This approach is particularly valuable if payload data can differ from partially built indexes.


437-453: get_is_null_checker parallels get_is_empty_checker
The logic is nearly identical yet adapted for null checks. Implementation is consistent. Similar expansions could apply if other indexes begin tracking null states differently.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (4)

41-52: Ensure robust directory creation and error handling.
When creating new directories, consider race conditions under concurrent usage. For instance, if another thread/process is creating the same directory at the same time, handle AlreadyExists gracefully rather than treating it as a terminal error.


95-130: Validate correct handling for multi-valued payloads.
Within add_point, if payload contains multiple arrays or nested data, ensure has_values is set accurately (e.g., for non-empty arrays). Otherwise, legitimate values might be ignored.


112-116: Revisit array emptiness logic and consider adding a dedicated test.
A previous comment suggested marking has_values as true for non-empty arrays. If the intention is to treat non-empty arrays as 'has values,' make sure to update has_values. Also, adding a targeted unit test would help validate various array payload cases.


307-316: Check concurrency safety for builder usage.
While the builder approach is straightforward, if multiple threads use the same builder, there may be data races on the internal slices. Document or enforce single-thread usage if concurrency is not intended.

lib/segment/src/types.rs (2)

2122-2125: Clarify semantics of is_empty vs. is_null.
Having both options can lead to overlapping semantics (e.g., an array could be empty but not null). Provide guidance or checks to avoid confusion, especially if user expects “empty” to include or exclude null.


2319-2332: Streamline usage of From<JsonPath> for null/empty checks.
These conversions are handy, but verify that they work as intended with nested JSON paths. For instance, “key.subkey” for null vs. empty might behave differently if key or subkey is missing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (2)

95-133: Add additional defensive checks for payload handling

The add_point implementation could benefit from some additional defensive programming to handle edge cases.

Consider adding a check for empty payload:

pub fn add_point(&mut self, id: PointOffsetType, payload: &[&Value]) -> OperationResult<()> {
    let mut is_null = false;
    let mut has_values = false;
+   
+   // Empty payload is equivalent to no values and no nulls
+   if payload.is_empty() {
+       self.has_values_slice.set_with_resize(id, false)?;
+       self.is_null_slice.set_with_resize(id, false)?;
+       return Ok(());
+   }

    for value in payload {
        // Rest of the implementation...
    }
    // ...
}

99-124: Early return optimization in add_point match logic

In the current implementation, we continue processing elements even after determining that a value is null.

Consider breaking out of the loop earlier when both is_null and has_values are true:

for value in payload {
    match value {
        // existing implementation...
    }
-   if is_null {
+   if is_null && has_values {
        break;
    }
}

This optimization avoids unnecessary processing when we've already determined the complete state of the entry.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0fc25b and f22e1a2.

📒 Files selected for processing (3)
  • lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (4 hunks)
  • tests/openapi/test_basic_retrieve_api.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (macos-latest)
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (10)
lib/segment/src/payload_storage/condition_checker.rs (4)

46-48: Well-structured implementation of these new fields

The addition of is_empty and is_null fields to the FieldCondition struct is clean and follows the existing pattern of optional condition fields.


71-78: Comprehensive implementation of the is_empty condition checking

The implementation correctly handles different types of JSON values according to your documentation. For arrays, it checks if any non-null elements exist, which aligns with the expected behavior where an array is considered empty if it contains only null values or is an empty array.

According to the past discussion, this matches your documentation: "This condition will match all records where the field reports either does not exist, or has null or [] value."


79-86: Logical implementation of the is_null condition checking

The implementation for is_null properly handles all JSON value types. For arrays, it checks if any element in the array is null, which aligns with the expected behavior where an array is considered to have null if at least one element is null.


323-419: Comprehensive test coverage for the new condition checks

The test suite thoroughly validates the behavior of the is_empty and is_null conditions against various JSON value types, including edge cases like empty arrays, arrays with null values, and arrays with a mix of null and non-null values.

tests/openapi/test_basic_retrieve_api.py (3)

183-205: Additional alternative is_empty syntax test improves coverage

This test complements the existing is_empty condition test by verifying that the newer syntax (using "key": "city", "is_empty": true) produces the same results as the older syntax (using "is_empty": {"key": "city"}). This ensures backward compatibility while introducing the more intuitive syntax.


236-259: Comprehensive test for is_null condition with consistent result verification

Good test implementation that verifies both syntaxes for the is_null condition produce identical results. The assertion of identical IDs between the two approaches ensures consistency in your API.


294-317: Well-structured test for is_null negation

The test validates that the must_not/is_null combination is equivalent to using is_null: false directly. This is important as it demonstrates both approaches produce consistent results and helps users understand the relationship between these query patterns.

lib/segment/src/index/field_index/null_index/mmap_null_index.rs (3)

18-28: Well-designed index structure for empty/null conditions

The MmapNullIndex structure is well-designed with a clear separation of concerns between tracking values and nulls. The documentation clearly explains the purpose and expected usage of this specialized index.


207-250: Clear and efficient filter implementation

The filter implementation correctly handles the different scenarios for is_empty and is_null conditions. The implementation is efficient by leveraging the precomputed flags and the comments clearly explain the reasoning behind each branch.


334-428: Comprehensive test coverage for null index

The test for the null index is thorough and covers all important cases:

  1. Testing various combinations of values (null, array with null, empty, non-null)
  2. Validating both positive and negative conditions
  3. Verifying cardinality estimation

The test structure makes it easy to understand what's being tested and the expected outcomes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/payload_storage/condition_checker.rs (1)

40-49: Clarify logic in check_is_empty.

Using array.iter().any(|x| !x.is_null()) != is_empty is correct but slightly opaque. A short inline explanation or a more direct condition like .all(|x| x.is_null()) might improve clarity.

- Value::Array(array) => array.iter().any(|x| !x.is_null()) != is_empty,
+ Value::Array(array) => {
+     let has_non_null = array.iter().any(|x| !x.is_null());
+     has_non_null != is_empty
+ }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f22e1a2 and 82fb933.

📒 Files selected for processing (4)
  • lib/segment/src/common/utils.rs (1 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (5 hunks)
  • lib/segment/src/payload_storage/query_checker.rs (1 hunks)
  • lib/segment/src/types.rs (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/segment/src/types.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (8)
lib/segment/src/payload_storage/query_checker.rs (1)

193-195: Verify the early return behavior for empty fields.

When field_values is empty, the code returns immediately without applying potential field-index logic. If this behavior is intentional, no changes are required; otherwise, consider integrating a fallback check for indexes to ensure consistent results.

lib/segment/src/common/utils.rs (1)

21-21: Definition of empty arrays now includes all-null arrays.

Returning true for arrays whose elements are all null aligns with the documented requirement that null values are treated as non-data. This is consistent with the new semantics of “empty” in the codebase.

lib/segment/src/payload_storage/condition_checker.rs (6)

33-38: Default check_empty implementation may mask specialized behaviors.

By always returning false, custom types that implement ValueChecker won't detect emptiness unless overridden. If this is the intended fallback, it’s correct. Otherwise, you might consider documenting that implementers must override check_empty where appropriate.


51-59: Partial null checks in arrays.

array.iter().any(|x| x.is_null()) == is_null means an array with at least one null satisfies is_null = true. This matches the stated requirement that “null is not a value” and is consistent with your documentation.


73-75: Adding is_empty and is_null checks to FieldCondition.

Including these fields in the destructured logic and OR condition ensures that empty/null checks seamlessly integrate with existing match, range, and geo checks. The approach is sound.

Also applies to: 98-99


103-127: Order of condition checks can skip multi-condition scenarios.

When both values_count and is_empty/is_null are present, the function handles values_count first. If you intend to prioritize values_count over empty/null checks, this ordering is correct. Otherwise, consider making the logic more explicit about combining conditions.


129-148: check_empty favoring is_empty over is_null.

The code returns is_empty first if present; if not, it returns the inverse of is_null. This is logical but means specifying both fields simultaneously might lead to unexpected outcomes for users unless well-documented.


377-473: Extensive test coverage for null and empty conditions.

Good job adding thorough tests, including edge cases like [null] and partial-null arrays, ensuring robust verification of the new logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1)

95-127: 🛠️ Refactor suggestion

Improve array handling logic

The current implementation for handling arrays has a potential issue. According to best practices, an empty array should be considered as having no values.

Modify the array handling to properly distinguish between empty arrays and arrays with values:

                Value::Array(array) => {
-                    if array.iter().any(|v| v.is_null()) {
+                    if !array.is_empty() {
+                        if array.iter().any(|v| v.is_null()) {
+                            is_null = true;
+                        }
+                        if array.iter().any(|v| !v.is_null()) {
+                            has_values = true;
+                        }
+                    }
-                        is_null = true;
-                    }
-                    if array.iter().any(|v| !v.is_null()) {
-                        has_values = true;
-                    }
                }

This implementation considers empty arrays as neither having values nor nulls, which aligns with the expected behavior.

🧹 Nitpick comments (4)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (4)

141-143: Consider renaming method for clarity

The method name values_count is potentially misleading since it returns either 0 or 1 based on the presence of values, not an actual count of values.

-    pub fn values_count(&self, id: PointOffsetType) -> usize {
-        usize::from(self.has_values_slice.get(id))
-    }
+    pub fn has_values(&self, id: PointOffsetType) -> bool {
+        self.has_values_slice.get(id)
+    }

This makes the API more intuitive and consistent with the other boolean methods like values_is_empty and values_is_null.


145-151: Consistent method naming pattern

The method names values_is_empty and values_is_null follow a consistent pattern, which is good. However, consider standardizing the naming to improve readability.

-    pub fn values_is_empty(&self, id: PointOffsetType) -> bool {
+    pub fn is_empty(&self, id: PointOffsetType) -> bool {
         !self.has_values_slice.get(id)
     }
 
-    pub fn values_is_null(&self, id: PointOffsetType) -> bool {
+    pub fn has_null(&self, id: PointOffsetType) -> bool {
         self.is_null_slice.get(id)
     }

This makes the API more concise and aligns better with the filter condition naming.


163-176: Consider adding validation for cleanup

The cleanup method directly calls remove_dir_all without checking if the directory exists first. While this likely works fine in practice, it's a good practice to validate existence before removal.

    fn cleanup(self) -> OperationResult<()> {
-        std::fs::remove_dir_all(&self.base_dir)?;
+        if self.base_dir.exists() {
+            std::fs::remove_dir_all(&self.base_dir)?;
+        }
        Ok(())
    }

This prevents potential errors if the directory was already removed by another process.


185-192: Optimize the flusher function

The flusher function creates closures for both slices and then calls them one after another. For better error handling, consider the following approach:

    fn flusher(&self) -> Flusher {
        let Self {
            base_dir: _,
            has_values_slice,
            is_null_slice,
        } = self;

-        let is_empty_flusher = has_values_slice.flusher();
-        let is_null_flusher = is_null_slice.flusher();
+        let has_values_slice_cloned = has_values_slice.clone();
+        let is_null_slice_cloned = is_null_slice.clone();

        Box::new(move || {
-            is_empty_flusher()?;
-            is_null_flusher()?;
+            has_values_slice_cloned.flusher()()?;
+            is_null_slice_cloned.flusher()()?;
            Ok(())
        })
    }

This approach would only work if DynamicMmapFlags implements Clone. If it doesn't, the current implementation is fine.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82fb933 and 87e1c30.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consensus
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: test (windows-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (4)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (4)

18-28: Well-designed specialized index structure

This implementation of a specialized index for null and empty values is a good approach to optimize query performance for these specific conditions. The design choice to keep separate flags for tracking values and nulls enables efficient filtering.


223-250: Well-implemented filtering logic

The filter implementation correctly handles the different filtering conditions based on is_empty and is_null. The approach to iterate only over known values when possible and return None when external iteration is required is a good optimization.


304-325: Builder pattern implementation

The builder pattern is properly implemented with clear initialization and finalization steps. The delegation to the underlying MmapNullIndex for add_point ensures consistency.


335-428:

❓ Verification inconclusive

Comprehensive test coverage

The test case thoroughly checks various scenarios including null values, null values in arrays, empty payloads, and non-null values. It validates the behavior of both filtering and cardinality estimation.

Consider adding a test for edge cases like large point IDs near the capacity of DynamicMmapFlags to ensure proper resizing behavior.

How does DynamicMmapFlags handle resizing for large point IDs? Is there a risk of performance degradation or memory issues?


🏁 Script executed:

#!/bin/bash
# Check how DynamicMmapFlags handles resizing
rg -A 10 "set_with_resize" lib/segment/src/vector_storage/dense/dynamic_mmap_flags.rs

Length of output: 511


Edge Case Test Recommendation

The current test case robustly checks filtering and cardinality logic for various payloads like null values, arrays with nulls, empty payloads, and non-null values. However, we should consider adding tests for edge cases that use very large point IDs.

Our verification shows that in lib/segment/src/vector_storage/dense/dynamic_mmap_flags.rs, the set_with_resize function dynamically resizes the index by setting the new length to key + 1 when a true value is provided, while it skips resizing for false values (given the default is false). Although this resizing strategy works as intended, it could lead to performance or memory issues if many high point IDs trigger frequent resizes.

Please add tests to:

  • Verify the resizing behavior for large point IDs.
  • Assess potential performance or memory overhead when resizing is triggered at scale.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (4)

95-127: Consider reordering the Value pattern matching for better performance.

The current pattern matching implementation is correct, but you might get slight performance improvements by reordering the match arms to handle the most common cases first. Typically, strings, numbers and booleans are more common than arrays, objects and null values.

for value in payload {
    match value {
+       Value::String(_) => {
+           has_values = true;
+       }
+       Value::Number(_) => {
+           has_values = true;
+       }
+       Value::Bool(_) => {
+           has_values = true;
+       }
        Value::Null => {
            is_null = true;
        }
-       Value::Bool(_) => {
-           has_values = true;
-       }
-       Value::Number(_) => {
-           has_values = true;
-       }
-       Value::String(_) => {
-           has_values = true;
-       }
        Value::Array(array) => {
            if array.iter().any(|v| v.is_null()) {
                is_null = true;
            }
            if !array.is_empty() {
                has_values = true;
            }
        }
        Value::Object(_) => {
            has_values = true;
        }
    }
    if is_null && has_values {
        break;
    }
}

207-250: Add documentation explaining when None is returned from the filter method.

The method could benefit from more detailed documentation explaining when and why it returns None versus an iterator. Currently, a reader needs to carefully analyze the code to understand that None is returned when external filtering is needed.

    fn filter<'a>(
        &'a self,
        condition: &'a FieldCondition,
+       // Returns:
+       // - Some(iter) when the condition can be efficiently filtered using this index
+       // - None when the condition requires external filtering (e.g., for negative conditions
+       //   like !is_empty or !is_null where we don't have complete information)
    ) -> Option<Box<dyn Iterator<Item = PointOffsetType> + 'a>> {

252-292: Consider refactoring shared logic between filter and estimate_cardinality.

The logic in estimate_cardinality closely mirrors that in filter. This duplication could potentially lead to maintenance issues if one is updated but not the other. Consider extracting the shared condition evaluation logic into a private helper method.

For example:

// Private helper method to determine if the condition can be handled efficiently
fn can_handle_condition(&self, condition: &FieldCondition) -> Option<(bool, bool)> {
    if let Some(is_empty) = condition.is_empty {
        if is_empty {
            None // External filtering needed
        } else {
            Some((true, false)) // has_values, is_non_empty_condition
        }
    } else if let Some(is_null) = condition.is_null {
        if is_null {
            Some((false, true)) // is_null_slice, is_null_condition
        } else {
            None // External filtering needed
        }
    } else {
        None
    }
}

// Then use in both methods:
// if let Some((use_has_values, is_positive)) = self.can_handle_condition(condition) {
//     let slice = if use_has_values { &self.has_values_slice } else { &self.is_null_slice };
//     ...
// }

168-171: Consider documenting why the load method is a no-op.

The load method is implemented as a no-op but without explanation. It would be helpful to document why no loading is needed for this type of index, especially since it's implementing a trait where this method might be expected to do something.

    fn load(&mut self) -> OperationResult<bool> {
-       // Nothing needed
+       // No loading needed as the memory-mapped flags are loaded on construction
        Ok(true)
    }
lib/segment/src/payload_storage/condition_checker.rs (1)

102-127: Prioritization logic may need comments

The check method establishes a clear priority between checks: values_count first, then is_empty, then is_null, and finally the default check. This prioritization logic might be non-obvious to future maintainers.

Consider adding a comment explaining the rationale behind this prioritization:

  fn check(&self, payload: &Value) -> bool {
      let FieldCondition {
          r#match: _,
          range: _,
          geo_radius: _,
          geo_bounding_box: _,
          geo_polygon: _,
          values_count,
          key: _,
          is_empty,
          is_null,
      } = self;

+     // Priority order: values_count > is_empty > is_null > other conditions
      if values_count.is_some() {
          self.values_count
              .as_ref()
              .unwrap()
              .check_count_from(payload)
      } else if is_empty.is_some() {
          check_is_empty(is_empty.unwrap(), payload)
      } else if is_null.is_some() {
          check_is_null(is_null.unwrap(), payload)
      } else {
          self._check(payload)
      }
  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e1c30 and 952d23f.

📒 Files selected for processing (2)
  • lib/segment/src/index/field_index/null_index/mmap_null_index.rs (1 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test (macos-latest)
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test-consensus
🔇 Additional comments (11)
lib/segment/src/index/field_index/null_index/mmap_null_index.rs (3)

18-28: Well-designed index struct for optimizing null and empty value queries.

The MmapNullIndex struct is well-designed for its specialized purpose of optimizing IsNull and IsEmpty conditions. The two separate flags for tracking whether values exist (has_values_slice) and whether null values exist (is_null_slice) allow for efficient querying of both conditions independently.


112-119: The array handling logic correctly implements the design decision.

The logic for handling arrays correctly implements the design decision that null values don't count as values (based on previous review comments). The code correctly checks for nulls and non-empty arrays separately.


334-428: Comprehensive test with good coverage of different cases.

The test thoroughly verifies the behavior of the null index with various combinations of values (null, null in array, empty, non-null). It validates both the filtering functionality and cardinality estimation.

lib/segment/src/payload_storage/condition_checker.rs (8)

33-37: Strong addition of empty value handling to the ValueChecker trait

The addition of the check_empty method to the ValueChecker trait provides a clear way to determine if a condition should match when a field is empty or missing from a payload. The default implementation returning false is appropriate for most condition types.


51-59: Well-structured null checking logic

The implementation of check_is_null is clear and properly handles different value types. The approach for arrays (checking if any array element is null) is particularly useful for JSON data processing.


63-99: Effective integration of is_empty and is_null conditions into field checking

The updated check_match method properly integrates the new conditions into the existing logical structure, maintaining the clarity of the code while expanding its functionality.


129-148: Clarify relationship between empty and null

In the check_empty method, when is_empty is not specified but is_null is, the method returns !*is_null. This implies that "empty" is the opposite of "null", which might not be immediately intuitive.

Is this relationship (empty being the opposite of null) aligned with the intended semantics of these conditions? This is especially important since the documentation seems to treat null and empty arrays as similar concepts.


296-298: Elegant implementation of check_empty for ValuesCount

Using check_count(0) to determine emptiness for ValuesCount is a clean and elegant approach that leverages existing functionality.


377-473: Comprehensive test coverage for new functionality

The test suite thoroughly covers all combinations of value types and conditions, providing excellent verification of the new functionality. The tests clearly demonstrate the expected behavior for each combination of payload value and condition.

The test cases confirm that:

  1. Empty arrays ([]) match when is_empty=true
  2. Arrays with null values ([null]) match when is_null=true, not when is_empty=true
  3. Arrays with both values and nulls match is_null=true if they contain at least one null
  4. Non-array values behave consistently with their expected emptiness/nullness

456-464:

❓ Verification inconclusive

Verify array null behavior matches requirements

The test shows that an array containing both a value and null ([true, null]) matches is_null: true. This means any array containing at least one null element is treated as "null" for filtering purposes.

Is this behavior (matching arrays with any null as opposed to only nulls) aligned with user expectations? If a user filters with is_null: true, they might expect to get only fields that are exclusively null, not fields that contain some non-null values.


🏁 Script executed:

#!/bin/bash
# Check for any documentation or comments about how is_null should behave with arrays
rg -A 5 -B 5 "is_null.*?array" --type rust
rg -A 5 -B 5 "array.*?is_null" --type rust

Length of output: 3398


Action Required: Confirm the Intended “is_null” Behavior for Arrays

The current implementation determines an array as null if it contains any null element (via array.iter().any(|x| x.is_null()) == is_null). The tests in lib/segment/src/payload_storage/condition_checker.rs and similar logic in the null index module confirm this behavior. However, note that a user filtering with is_null: true might expect to retrieve only arrays made up exclusively of null values. Please verify that this “any null” matching is indeed the intended design for filtering.

  • File Checked: lib/segment/src/payload_storage/condition_checker.rs
  • Key Logic: Value::Array(array) => array.iter().any(|x| x.is_null()) == is_null
  • Implication: Arrays containing both null and non-null values (e.g. [true, null]) are considered null.

40-49:

❓ Verification inconclusive

Verify if array with null values is considered empty

The check_is_empty function treats array emptiness based on length, but there's a distinction in the implementation compared to the documentation reference: an array with null values ([null]) is not considered empty here.

According to the past review comment referencing your documentation:

This condition will match all records where the field reports either does not exist, or has null or [] value.

Is this intentional that an array containing only null ([null]) is not considered empty, while an empty array ([]) is?


🏁 Script executed:

#!/bin/bash
# Looking for other usages of is_empty to verify consistency
rg -A 3 -B 3 "is_empty.*?true" --type rust

Length of output: 4521


Review: Clarify Handling of Arrays with Null Values

The current implementation in lib/segment/src/payload_storage/condition_checker.rs at lines 40–49 treats arrays by checking their length (array.is_empty() == is_empty). This means an array like [null] is not considered empty, even though the associated documentation states that a field with a null or empty array ([]) should be treated as empty. We should verify whether this behavior is intentional. If the intention is to treat an array containing only null as empty, the function might need to be updated to inspect array elements (or adjust the documentation if the current behavior is desired).

  • File: lib/segment/src/payload_storage/condition_checker.rs (Lines 40–49)
  • Current behavior: For Value::Array(array), emptiness is determined solely by the length of the array.
  • Documentation expectation: Fields with either a null value or an empty array should be considered empty, implying that [null] might also qualify.

Please confirm if this discrepancy is by design or if additional checks (e.g., treating arrays containing only null as empty) should be implemented.

Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the tests 👍

@generall generall merged commit ea6dc0a into dev Mar 4, 2025
17 checks passed
@generall generall deleted the null-index branch March 4, 2025 12:19
timvisee added a commit that referenced this pull request Mar 21, 2025
* create initial strucutres

* clippy

* start field-query refactoring

* start field-query refactoring (2/N)

* start field-query refactoring (3/N): duplicate is_empty/null condiftions as field condition

* start field-query refactoring (4/N): re-instate is_empty fallback in case new index is not built yet

* filter for is_empty/is_null

* implement add/remove point

* upd schema

* open and create of null-index

* create null-index

* fix test

* Update lib/segment/src/index/query_optimization/condition_converter.rs

Co-authored-by: Tim Visée <tim+github@visee.me>

* unit test for null-index

* more unit tests

* add openapi tests

* fmt

* fix for integartion tests

* rabbit review fix

* make [null] non-empty

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants