Skip to content

Conversation

xzfc
Copy link
Contributor

@xzfc xzfc commented Apr 4, 2025

This PR implements incremental HNSW building for the append-only case.

A new feature flag is added, incremental_hnsw_building, disabled by default.

When enabled, it will try to reuse old graph as a starting point for building the new graph. When reusing, point levels and old graph links are copied as-is (but with reassigned IDs) before the regular HNSW point insertion process starts.

What is not supported as of now:

  • GPU building. Only point levels are copied.
  • Point/vector removal/overwriting. If a point or vector is removed, the old graph is discarded. Same if vector value is changed.

When old index is reused, the following log messages are printed:

DEBUG segment::index::hnsw_index::hnsw: building HNSW for 518500 vectors with 8 CPUs
DEBUG segment::index::hnsw_index::hnsw: Reusing 457700 points from the old index

Benchmarked using bfb --indexing-threshold 400 --dim 32 --num-vectors 2000000 --throttle 500.
Upload time: 40 seconds.
Upload+index (incremental): 40+17 = 57 seconds.
Upload+index (non-incremental): 40+84 = 124 seconds.

@xzfc xzfc requested review from timvisee and generall April 4, 2025 11:40
Copy link
Contributor

coderabbitai bot commented Apr 4, 2025

📝 Walkthrough

Walkthrough

This pull request introduces updates that integrate a new feature flag for incremental HNSW building and propagates its usage across several modules. A boolean field, incremental_hnsw_building, is added to the common feature flags struct with corresponding updates in initialization and emptiness checking methods. The flag is consistently passed as a parameter in HNSW index construction functions across benchmarking, indexing, segment building, and various integration test modules. New methods are added to manage point linking and to support incremental indexing by reusing old indices through newly defined helper structs and evaluation methods. Additionally, a new workspace dependency is included in Cargo.toml, and adjustments are made to the visibility of certain components. A new integration test file focusing on incremental HNSW index building is also incorporated to validate the enhancements.

Suggested reviewers

  • timvisee
  • agourlay
  • generall

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b69303 and 1a7be5a.

📒 Files selected for processing (1)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

406-407: Enhance robustness with validation for point readiness.

The addition of debug_assert!(!was_ready) helps detect logic errors where a point might be marked as ready twice, preventing potential index corruption.


415-432: LGTM: Well-implemented method for incremental HNSW building.

This new method add_new_point is a key component for enabling incremental HNSW indexing - allowing reuse of existing links when rebuilding the index. The implementation correctly:

  1. Validates that the provided levels match the point's level structure
  2. Efficiently replaces links at each level with the provided neighbors
  3. Marks the point as ready and adds it to entry points

The mutual exclusivity with link_new_point is properly documented, making the usage pattern clear.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/segment/tests/integration/exact_search_test.rs (1)

1-219: Consider adding test cases for incremental HNSW building

While the feature flag integration looks good, it would be valuable to add specific test cases that enable the incremental_hnsw_building flag to verify the functionality works as expected. This would help ensure the new incremental building path is properly tested.

lib/segment/src/index/hnsw_index/hnsw.rs (2)

244-256: Check for old index reusability conditionally on feature_flags.
This block properly:

  • Ensures incremental_hnsw_building is enabled before attempting reuse,
  • Evaluates prospective old indexes,
  • Selects the one yielding the maximum valid_points.

Consider logging a warning or returning earlier if multiple old indexes are passed in, clarifying that only one is reused.


373-384: Selective reuse of old links.
Reusing old links if found, or falling back to fresh insertion matches the PR’s append-only approach. Consider adding a debug log entry for each link reuse for better visibility in production.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 289f9b5 and 031be83.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • lib/common/common/src/flags.rs (3 hunks)
  • lib/segment/Cargo.toml (1 hunks)
  • lib/segment/benches/multi_vector_search.rs (2 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (7 hunks)
  • lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2 hunks)
  • lib/segment/src/segment_constructor/segment_builder.rs (1 hunks)
  • lib/segment/src/segment_constructor/segment_constructor_base.rs (2 hunks)
  • lib/segment/tests/integration/batch_search_test.rs (2 hunks)
  • lib/segment/tests/integration/byte_storage_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/byte_storage_quantization_test.rs (2 hunks)
  • lib/segment/tests/integration/exact_search_test.rs (2 hunks)
  • lib/segment/tests/integration/filtrable_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/hnsw_discover_test.rs (3 hunks)
  • lib/segment/tests/integration/hnsw_incremental_build.rs (1 hunks)
  • lib/segment/tests/integration/hnsw_quantized_search_test.rs (4 hunks)
  • lib/segment/tests/integration/main.rs (1 hunks)
  • lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/multivector_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/multivector_quantization_test.rs (2 hunks)
🧰 Additional context used
🧬 Code Definitions (7)
lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2)
src/settings.rs (2)
  • default (96-102)
  • default (124-132)
lib/common/common/src/budget.rs (1)
  • default (239-243)
lib/segment/tests/integration/byte_storage_quantization_test.rs (2)
lib/segment/src/types.rs (5)
  • default (1109-1118)
  • default (1122-1124)
  • default (1529-1531)
  • default (2737-2739)
  • default (2770-2772)
lib/common/common/src/budget.rs (1)
  • default (239-243)
lib/segment/src/segment_constructor/segment_builder.rs (1)
lib/common/common/src/flags.rs (1)
  • feature_flags (89-97)
lib/segment/src/segment_constructor/segment_constructor_base.rs (1)
lib/common/common/src/flags.rs (1)
  • feature_flags (89-97)
lib/segment/tests/integration/main.rs (1)
lib/segment/tests/integration/hnsw_incremental_build.rs (1)
  • hnsw_incremental_build (39-95)
lib/segment/tests/integration/hnsw_incremental_build.rs (2)
lib/segment/tests/integration/hnsw_quantized_search_test.rs (3)
  • hnsw_quantized_search_test (44-181)
  • check_matches (183-222)
  • query_vectors (191-200)
lib/segment/src/index/hnsw_index/hnsw.rs (2)
  • new (99-109)
  • build (184-545)
lib/segment/src/index/hnsw_index/hnsw.rs (2)
lib/segment/src/index/hnsw_index/graph_links.rs (2)
  • links (126-128)
  • point_level (130-132)
lib/segment/src/index/hnsw_index/graph_layers.rs (1)
  • point_level (200-202)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (59)
lib/segment/tests/integration/main.rs (1)

13-13: LGTM! The new test module for incremental HNSW building aligns with PR objectives.

The addition of the hnsw_incremental_build module is a good approach to testing the new incremental HNSW building feature described in the PR objectives.

lib/segment/Cargo.toml (1)

98-98: LGTM! Adding the tap crate as a dependency.

The tap crate is a utility library that allows for performing side effects on values within a method chain, which can be useful for incremental building operations.

lib/segment/tests/integration/byte_storage_hnsw_test.rs (2)

6-6: LGTM! Necessary import for new feature flags.

This import is required to use the FeatureFlags structure which will control the incremental HNSW building functionality.


241-241: LGTM! Adding feature flags parameter to HNSW build arguments.

This change passes the feature flags to the HNSW index builder, which enables controlling the incremental building behavior via configuration rather than hard-coding it.

lib/segment/benches/multi_vector_search.rs (2)

7-7: LGTM! Importing FeatureFlags for benchmark configuration.

The import is necessary to pass feature flags to the HNSW builder.


97-97: LGTM! Adding feature flags to benchmark HNSW index construction.

This change ensures that benchmarks use the same builder interface as production code, maintaining consistency across the codebase with the new feature flag parameter.

lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2)

6-6: LGTM: Added the FeatureFlags import.

This import is necessary to support the new feature_flags parameter for HNSW building.


83-83: LGTM: Added feature_flags parameter.

This change properly passes the feature flags configuration to the HNSW index builder. Using the default value ensures backwards compatibility in tests while allowing the new incremental HNSW building feature to be toggled.

lib/common/common/src/flags.rs (3)

41-43: LGTM: Added the incremental_hnsw_building flag.

This new feature flag is appropriately documented and follows the established pattern with the #[serde(default)] attribute to ensure it defaults to false when deserializing.


54-54: LGTM: Updated is_empty method for the new flag.

The is_empty method is correctly updated to check the new incremental_hnsw_building flag.

Also applies to: 59-59


71-71: LGTM: Updated init_feature_flags to handle the new flag.

The init_feature_flags function correctly sets the new incremental_hnsw_building flag to true when the all flag is activated.

Also applies to: 79-79

lib/segment/tests/integration/multivector_hnsw_test.rs (2)

7-7: LGTM: Added the FeatureFlags import.

This import enables the use of feature flags in the multivector HNSW tests.


152-152: LGTM: Added feature_flags parameter.

This change properly passes the feature flags configuration to the HNSW index builder in the multivector test. Using the default value ensures the test behaves consistently with the current implementation while allowing the incremental HNSW building feature to be toggled.

lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs (2)

6-6: LGTM: Added the FeatureFlags import.

This import enables the use of feature flags in the multivector filterable HNSW tests.


209-209: LGTM: Added feature_flags parameter.

This change properly passes the feature flags configuration to the HNSW index builder in the multivector filterable test. Using the default value ensures backward compatibility while allowing the new incremental HNSW building feature to be toggled when needed.

lib/segment/tests/integration/multivector_quantization_test.rs (2)

8-8: Import of FeatureFlags for incremental HNSW building

Adding the FeatureFlags import to support the new incremental HNSW building feature flag.


343-343: Added feature flags parameter to HNSW index build

The feature_flags parameter is properly added with the default value, which means incremental HNSW building will be disabled by default in tests. This matches the PR's design where incremental building is opt-in.

lib/segment/tests/integration/hnsw_discover_test.rs (3)

6-6: Import of FeatureFlags for incremental HNSW building

Adding the FeatureFlags import to support the new incremental HNSW building feature flag.


118-118: Added feature flags parameter in hnsw_discover_precision test

The feature_flags parameter is properly added with the default value, ensuring consistent behavior with the rest of the test suite.


242-242: Added feature flags parameter in filtered_hnsw_discover_precision test

The feature_flags parameter is properly added with the default value to the second test function, maintaining consistency throughout the file.

lib/segment/tests/integration/filtrable_hnsw_test.rs (2)

7-7: Import of FeatureFlags for incremental HNSW building

Adding the FeatureFlags import to support the new incremental HNSW building feature flag.


204-204: Added feature flags parameter to HNSW index build

The feature_flags parameter is correctly added with the default value. This ensures consistency across test files and properly tests the non-incremental building path by default.

lib/segment/tests/integration/exact_search_test.rs (2)

7-7: Import of FeatureFlags for incremental HNSW building

Adding the FeatureFlags import to support the new incremental HNSW building feature flag.


145-145: Added feature flags parameter to HNSW index build

The feature_flags parameter is properly added with the default value, which matches the implementation in other test files and ensures consistent testing behavior.

lib/segment/tests/integration/batch_search_test.rs (2)

6-6: Good addition of the FeatureFlags import.

This import is necessary to support the new incremental HNSW building feature flag parameter that's being passed to the HNSWIndex::build function.


166-166: Good addition of the feature_flags parameter.

The feature_flags parameter is properly passed to HNSWIndex::build with default values, which aligns with the PR's implementation of incremental HNSW building (disabled by default).

lib/segment/src/segment_constructor/segment_builder.rs (1)

583-583: Good addition of the feature_flags parameter.

This change properly passes the feature flags to the build_vector_index function, which is necessary for the incremental HNSW building feature. Using the feature_flags() function ensures that the globally configured flags are used.

lib/segment/tests/integration/byte_storage_quantization_test.rs (2)

7-7: Good addition of the FeatureFlags import.

This import is consistent with the other test files and enables the use of feature flags in the test.


372-372: Good addition of the feature_flags parameter.

The feature_flags parameter is correctly passed to HNSWIndex::build with default values, which is consistent with the other tests and ensures the incremental HNSW building feature is properly tested (disabled by default).

lib/segment/src/segment_constructor/segment_constructor_base.rs (2)

10-10: Good update to the import statement.

The import statement is correctly updated to include both FeatureFlags and feature_flags, which is necessary for defining the new field in the VectorIndexBuildArgs struct.


388-388: Good addition of the feature_flags field.

Adding the feature_flags field to the VectorIndexBuildArgs struct is a well-structured approach that enables passing the incremental HNSW building flag to the index builder. This aligns with the PR objective of implementing incremental HNSW building for the append-only case.

lib/segment/src/index/hnsw_index/graph_links.rs (1)

17-18: Clean modularization of exports

The change improves the module's organization by clearly separating public exports from internal usage. LinksIterator is now publicly re-exported, while CompressionInfo and GraphLinksView are kept as internal imports.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)

406-407: Improved state validation with assertion

The modification to assert that points aren't double-marked as ready is excellent defensive programming. This validates that the index state is consistent, which is crucial for the incremental building feature where points from old indices must be tracked carefully.


415-432: Key functionality for incremental HNSW building

This new method is central to the incremental HNSW building feature. It allows adding a point with pre-existing links directly, rather than computing new links. The implementation correctly:

  1. Verifies the level structure matches expectations
  2. Preserves links from the old graph
  3. Maintains state consistency with the ready list
  4. Sets up entry points correctly

This is essential for reusing graph structures from previous indices.


434-440: Useful optimization for completing indexing

This helper method allows marking a point as ready and adding it to entry points without requiring a scorer. This is an optimization that avoids unnecessary computation when we already know a point's connections from a previous index.

lib/segment/tests/integration/hnsw_incremental_build.rs (5)

1-29: Comprehensive imports for integration test

The imports cover all the necessary components for testing incremental HNSW building. Good practice to import specific types rather than using wildcard imports.


30-37: Well-defined test constants

The constants clearly define the test parameters, making it easy to adjust them for different test scenarios.


38-95: Thorough integration test for incremental building

This test effectively validates the incremental HNSW building functionality by:

  1. Creating sequential segments with increasing data points
  2. Building indices with incremental reuse
  3. Verifying search accuracy at each step

The use of last_index to track and reuse previous indices simulates a real-world incremental update scenario.


97-117: Well-structured segment creation helper

Good separation of concerns with this helper function that handles segment creation. The shuffling of IDs helps ensure the test is robust against order dependencies.


119-160: Key test function enabling incremental building

This function is crucial for testing, as it:

  1. Creates an HNSW index with appropriate configuration
  2. Explicitly enables the incremental_hnsw_building feature flag
  3. Passes previous indices for potential reuse

This validates the core functionality described in the PR objectives.

lib/segment/tests/integration/hnsw_quantized_search_test.rs (4)

9-9: Required feature flag import

Adding the import for FeatureFlags is necessary for the updated function signature.


36-36: Exposed utility function for test reuse

Making sames_count public allows it to be reused in other test files, specifically in the new hnsw_incremental_build.rs test.


139-139: Added feature flags parameter

Adding the default feature flags parameter maintains compatibility with the updated VectorIndexBuildArgs signature. This ensures the test works with the updated API without changing behavior.


183-183: Exposed test function for reuse

Making check_matches public allows it to be reused in other test files, specifically in the new hnsw_incremental_build.rs test.

lib/segment/src/index/hnsw_index/hnsw.rs (15)

8-8: No issues with the atomic reference usage.
Importing AtomicRef and AtomicRefCell is appropriate for shared mutable data in this context.


16-16: Good use of EitherOrBoth for merged iteration.
This import simplifies handling merged lists of points from both new and old indexes.


46-46: Essential imports for indexing logic.
Needed for referencing PayloadIndex, VectorIndex, and VectorIndexEnum in new code paths.


209-209: New old_indices parameter.
Introducing the old_indices parameter is consistent with enabling incremental HNSW index reuse.


212-212: New feature_flags parameter.
Including feature_flags in the build function is a clean way to check if incremental building is enabled.


318-318: Reusing the old index if available.
The concise use of .map(...) to create an OldIndex instance neatly allows fallback to fresh construction when None.


324-324: No functional update.
This blank line adds slight readability; no further remark needed.


325-329: Graceful fallback to randomly assigned level.
Using old_index's point_level if present, otherwise calling get_random_layer, elegantly balances reuse with new generation.


386-389: Splitting off a subset of points for single-thread building.
Retaining the single-thread strategy for a final batch is consistent with the existing HNSW build approach and can help avoid graph fragmentation.


424-424: No functional change.
Blank line presumably for readability.


425-425: Closure-based insertion logic remains clear.
Defining the closure here is straightforward and keeps the insertion logic modular.


1361-1371: OldIndexCandidate struct definition.
Storing the borrowed HNSWIndex, mapping array, and valid_points provides a clear anchor for deciding index reusability.


1373-1383: OldIndex struct definition.
Maintaining both forward and reverse mappings (old→new and new→old) is crucial for link transformations in incremental building.


1490-1511: Implementation of OldIndex.
These methods provide:

  • point_level: Maps a new ID back to the old ID and retrieves its level.
  • get_links: Retrieves old links, truncates them per m or m0, and remaps them to new IDs.

Both methods cleanly handle partial usage of the old graph while respecting updated capacity constraints.


1385-1488:

❓ Verification inconclusive

Implementation of OldIndexCandidate.

  1. The evaluate method:
    • Correctly checks for config mismatches and bailouts if old vectors mismatch or are absent in the new set.
    • Appropriately defers any changed or removed vector scenario by returning None.
  2. The reuse method:
    • Constructs reverse mappings and logs the reuse count.
    • Seamlessly transitions OldIndexCandidate into a fully realized OldIndex.

Potential dimension mismatch check
Currently, dimension differences do not appear to be validated within evaluate. If multiple vector storages with different dimensions were used, it may lead to silent issues.

Below is a shell script to locate any dimension checks in the codebase. If none are found, consider adding an explicit dimension check:


🏁 Script executed:

#!/bin/bash
rg -i "dimension" --context 5

Length of output: 57662


Revisit Dimension Consistency in the OldIndexCandidate Evaluation

The implementation in OldIndexCandidate follows the intended logic by comparing configuration parameters and ensuring that only unchanged vectors are reused. However, note that there is no explicit check for vector dimensionality within the evaluate method. Although the overall codebase performs dimension validations elsewhere (for example, in lib/segment/src/common/mod.rs and the vector creation routines), if vector storages from differing sources could potentially be mixed, a missing early dimension check here might lead to silent mismatches.

  • The current equality check between old_vector and new_vector will reject vectors that differ (including due to a dimension mismatch), but the lack of an explicit dimension verification means the issue might not be reported clearly.
  • Consider whether an early, explicit check confirming that the vectors’ dimensions match (using the expected dimension from the configuration) could provide more precise error reporting or safety in cases where vector storages might have diverging dimensions.

@xzfc xzfc changed the title Pass FeatureFlags into VectorIndexBuildArgs Incremental HNSW index building: append-only case Apr 4, 2025
@xzfc xzfc force-pushed the incremental-hnsw-append branch from 031be83 to 1457a98 Compare April 4, 2025 12:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

434-440: Consider adding more detailed documentation.

The link_new_point_complete_indexing method provides a way to mark a point as ready without requiring a scorer. However, it could benefit from more detailed documentation explaining its specific use case and when it should be used instead of the other methods.

-    pub fn link_new_point_complete_indexing(&self, point_id: PointOffsetType) {
+    /// Marks a point as ready and adds it to entry points without connecting links or requiring a scorer.
+    /// This is used in incremental HNSW building after a point's links have been added separately.
+    pub fn link_new_point_complete_indexing(&self, point_id: PointOffsetType) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 031be83 and 1457a98.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • lib/common/common/src/flags.rs (3 hunks)
  • lib/segment/Cargo.toml (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (7 hunks)
  • lib/segment/tests/integration/gpu_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/hnsw_incremental_build.rs (1 hunks)
  • lib/segment/tests/integration/hnsw_quantized_search_test.rs (4 hunks)
  • lib/segment/tests/integration/main.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • lib/segment/tests/integration/main.rs
  • lib/segment/Cargo.toml
  • lib/common/common/src/flags.rs
  • lib/segment/src/index/hnsw_index/graph_links.rs
  • lib/segment/tests/integration/hnsw_incremental_build.rs
  • lib/segment/tests/integration/hnsw_quantized_search_test.rs
  • lib/segment/src/index/hnsw_index/hnsw.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (macos-latest)
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (4)
lib/segment/tests/integration/gpu_hnsw_test.rs (2)

5-6: Import added for new feature flag support.

The addition of FeatureFlags import is necessary to support the incremental HNSW building feature in the test file.


145-145: Feature flags correctly added to the index build arguments.

The feature_flags parameter is now properly passed to the HNSW index build function, using the default value which disables incremental building. This is appropriate since the PR mentions GPU building isn't fully supported with incremental HNSW.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

407-408: Improved state management with assertion check.

The code now verifies that the point wasn't already marked as ready, which helps catch potential bugs where a point might be added twice. This is good defensive programming.


415-433: Well-implemented method for adding points with pre-existing links.

This new method is the core of the incremental HNSW building feature. It correctly:

  1. Takes a point ID and pre-existing links
  2. Verifies the levels match the expected level
  3. Populates the links for each level
  4. Marks the point as ready and adds it to entry points

The documentation appropriately mentions mutual exclusivity with link_new_point.

@generall
Copy link
Member

generall commented Apr 4, 2025

@xzfc please resolve conflict

@xzfc xzfc force-pushed the incremental-hnsw-append branch from 1457a98 to 0942ff7 Compare April 4, 2025 15:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

415-432: New method for incremental HNSW building.

This new method allows adding a point with pre-existing links, which is essential for incremental HNSW index building. It appropriately validates that the provided levels match the expected point level and handles the link assignments correctly.

However, consider adding error handling instead of using assertions for input validation in this public method to make it more robust in production environments.

pub fn add_new_point(&self, point_id: PointOffsetType, levels: Vec<Vec<PointOffsetType>>) -> OperationResult<()> {
    let level = self.get_point_level(point_id);
    if levels.len() != level + 1 {
-       assert_eq!(levels.len(), level + 1);
+       return Err(OperationError::service_error(format!(
+           "Invalid levels length: expected {}, got {}",
+           level + 1,
+           levels.len()
+       )));
    }

    // Rest of the method...
    
    Ok(())
}
lib/segment/tests/integration/hnsw_incremental_build.rs (2)

38-95: Well-structured test for incremental HNSW building.

This test effectively validates the incremental HNSW index building functionality by iteratively adding more data points and ensuring search quality remains acceptable. The test properly enables the incremental_hnsw_building flag and validates results using check_matches.

Consider adding assertions to verify that incremental building is actually being used, such as logging the number of reused points and asserting that it's greater than zero in subsequent iterations.

let index = build_hnsw_index(&index_path, &segment, &old_indices);

+ // Only check for reuse after the first iteration
+ if i > 1 {
+     // This assumes some kind of logging or metric about reused points
+     // The actual implementation would need to expose this information
+     assert!(index.get_reused_points_count() > 0, 
+             "No points were reused in iteration {}", i);
+ }

let ef = 64;

119-160: Proper HNSW index building with feature flag enabled.

The implementation correctly sets up the HNSW index build configuration and enables the incremental building feature flag. Using a constant for CPU resources via num_rayon_threads is a good practice for testing.

It would be beneficial to also test the case where incremental_hnsw_building is set to false to ensure the system falls back to traditional building correctly.

#[test]
fn hnsw_incremental_build() {
    // ... existing test ...
}

+#[test]
+fn hnsw_non_incremental_build() {
+    // Same setup as hnsw_incremental_build but with:
+    flags.incremental_hnsw_building = false;
+    // ... and then verify results are still correct
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1457a98 and 0942ff7.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • lib/common/common/src/flags.rs (3 hunks)
  • lib/segment/Cargo.toml (1 hunks)
  • lib/segment/benches/multi_vector_search.rs (2 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (7 hunks)
  • lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2 hunks)
  • lib/segment/src/segment_constructor/segment_builder.rs (1 hunks)
  • lib/segment/src/segment_constructor/segment_constructor_base.rs (2 hunks)
  • lib/segment/tests/integration/batch_search_test.rs (2 hunks)
  • lib/segment/tests/integration/byte_storage_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/byte_storage_quantization_test.rs (2 hunks)
  • lib/segment/tests/integration/exact_search_test.rs (2 hunks)
  • lib/segment/tests/integration/filtrable_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/gpu_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/hnsw_discover_test.rs (3 hunks)
  • lib/segment/tests/integration/hnsw_incremental_build.rs (1 hunks)
  • lib/segment/tests/integration/hnsw_quantized_search_test.rs (4 hunks)
  • lib/segment/tests/integration/main.rs (1 hunks)
  • lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/multivector_hnsw_test.rs (2 hunks)
  • lib/segment/tests/integration/multivector_quantization_test.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (20)
  • lib/segment/tests/integration/byte_storage_hnsw_test.rs
  • lib/segment/tests/integration/main.rs
  • lib/segment/benches/multi_vector_search.rs
  • lib/segment/tests/integration/filtrable_hnsw_test.rs
  • lib/segment/src/segment_constructor/segment_constructor_base.rs
  • lib/segment/src/index/hnsw_index/graph_links.rs
  • lib/segment/Cargo.toml
  • lib/segment/src/segment_constructor/segment_builder.rs
  • lib/segment/tests/integration/gpu_hnsw_test.rs
  • lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs
  • lib/segment/tests/integration/batch_search_test.rs
  • lib/segment/tests/integration/multivector_quantization_test.rs
  • lib/segment/tests/integration/exact_search_test.rs
  • lib/segment/tests/integration/multivector_hnsw_test.rs
  • lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs
  • lib/segment/tests/integration/byte_storage_quantization_test.rs
  • lib/common/common/src/flags.rs
  • lib/segment/tests/integration/hnsw_discover_test.rs
  • lib/segment/tests/integration/hnsw_quantized_search_test.rs
  • lib/segment/src/index/hnsw_index/hnsw.rs
🧰 Additional context used
🧬 Code Definitions (1)
lib/segment/tests/integration/hnsw_incremental_build.rs (2)
lib/segment/tests/integration/hnsw_quantized_search_test.rs (3)
  • hnsw_quantized_search_test (44-181)
  • check_matches (183-222)
  • query_vectors (191-200)
lib/segment/src/index/hnsw_index/hnsw.rs (2)
  • new (99-109)
  • build (184-545)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
🔇 Additional comments (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

406-407: Added safety check to prevent double linking.

This change ensures that link_new_point cannot be called twice on the same point by asserting that the point wasn't already ready. This helps maintain the data structure's integrity.


434-440: Added helper method for completing indexing.

This function provides a streamlined way to mark a point as indexed without needing a scorer, which is useful for the incremental building workflow.

lib/segment/tests/integration/hnsw_incremental_build.rs (1)

97-117: Good segment creation helper.

The function properly randomizes the insertion order of vectors, which helps test resilience against different insertion patterns. Using hw_counter for hardware resource tracking is a good practice.

@timvisee
Copy link
Member

timvisee commented Apr 9, 2025

I was only able to partially test this locally. So thanks for adding those tests! The results actually look fantastic. I wonder how this would affect bulk upsertions without disabling HNSW building.

@@ -403,14 +403,42 @@ impl GraphLayersBuilder {
// New point is a new empty entry (for this filter, at least)
// We can't do much here, so just quit
}
self.ready_list.write().set(point_id as usize, true);
let was_ready = self.ready_list.write().replace(point_id as usize, true);
assert!(!was_ready);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug assert or do we want to crash here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, debug assert + log::error if we really care

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with debug_assert.

let old_index = old_indices
.iter()
.filter_map(|old_index| {
feature_flags.incremental_hnsw_building.then_some(())?;
Copy link
Member

@agourlay agourlay Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the feature flag repeatedly checked?
Can it be changed without restarting the service?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it's cheap (just a bool) I agree it might be better to not iterate at all if the flag is not set.

Can it be changed without restarting the service?

No, at least not in the duration of this function.

Copy link
Contributor Author

@xzfc xzfc Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optimized for easy removal. We might want to make it unconditional soon. So, removing this check would be an one-line change.

@agourlay
Copy link
Member

agourlay commented Apr 9, 2025

A new feature flag is added, incremental_hnsw_building, disabled by default.

Where you able to run the all CI suite with the flag enabled?

.take(SINGLE_THREADED_HNSW_BUILD_THRESHOLD)
.collect();
let ids: Vec<_> = ids_iterator.collect();
let first_few_ids = ids.split_off(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first_few_ids is only relevant if the graph is empty. Potentially we can skip this step if we have already added enough points with add_new_point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)

406-407: Clarify handling if point is already ready.

Currently, debug_assert!(!was_ready); ensures in debug mode that the point wasn’t previously marked as ready. In release mode, however, this condition won’t be enforced. If there is a possibility of re-linking or partial insertion in the future, consider logging or error handling to prevent unexpected double-insertion scenarios.


415-432: Validate neighbor IDs when adding pre-existing links.

While this function is convenient for incremental or bulk additions, consider validating that each neighbor ID is valid and marked as “ready” (or ensuring they will be set ready) before assignment. This reduces the risk of missing or invalid links when reusing partial graphs.


434-440: Ensure naming clarity for finalizing point indexing.

link_new_point_complete_indexing simply sets the point as ready and adds an entry point without performing link logic. Consider renaming it (e.g., mark_point_indexed) or merging its functionality with add_new_point if future usage patterns are similar, to avoid confusion.

lib/segment/src/index/hnsw_index/hnsw.rs (2)

244-255: Incremental index selection logic.

Reusing the best old index candidate by selecting the one with the highest valid_points is effective. Consider logging the number of reused vs. total indexed points to provide more visibility into performance gains or fallback scenarios when old index reuse isn’t possible.

Also applies to: 318-318


373-374: Efficiently combining reused links with fresh insertions.

The approach splits vectors that can reuse old links vs. those inserted anew in single-thread or parallel modes. This design is sensible; however, if the first-few-IDs logic grows or changes, you might consider unifying it with the old index logic for clarity.

Also applies to: 376-390

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0942ff7 and 8b69303.

📒 Files selected for processing (2)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/hnsw_index/hnsw.rs (9)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
  • total_vector_count (103-105)
lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)
  • total_vector_count (156-158)
lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
  • total_vector_count (348-350)
lib/segment/src/vector_storage/vector_storage_base.rs (2)
  • total_vector_count (47-47)
  • total_vector_count (519-545)
lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
  • total_vector_count (185-187)
lib/segment/src/index/hnsw_index/graph_links.rs (2)
  • links (126-128)
  • point_level (130-132)
lib/segment/src/id_tracker/id_tracker_base.rs (2)
  • total_point_count (109-109)
  • total_point_count (332-339)
lib/segment/src/segment/segment_ops.rs (1)
  • total_point_count (652-654)
lib/segment/src/index/hnsw_index/graph_layers.rs (1)
  • point_level (200-202)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: lint
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (5)
lib/segment/src/index/hnsw_index/hnsw.rs (5)

8-8: Imports for incremental functionality look valid.

The newly introduced imports (atomic_refcell, itertools::EitherOrBoth, VectorIndexEnum) align with appended logic for incremental HNSW building. No immediate issues noted.

Also applies to: 16-16, 46-46


209-209: New build parameters enable incremental HNSW building support.

Adding old_indices and feature_flags to VectorIndexBuildArgs appears correct. Ensure existing code paths that don’t supply these fields either set them to defaults or handle them gracefully.

Also applies to: 212-212


325-328: Fallback to random level is consistent.

When old index information isn’t available, defaulting to a random level is logical. The approach is straightforward, and no issues stand out with the fallback path.


432-433: Explicitly dropping old_index.

Dropping old_index if the main graph is already built on the GPU is clear. This is minor but maintains clarity that old index usage is no longer needed. No further concerns here.


1364-1525: Comprehensive incremental reuse structures.

OldIndexCandidate and OldIndex provide a solid foundation for append-only reuse. The logic to map old IDs to new IDs is straightforward and addresses only unchanged, undeleted points, which suits an append-only model. Verify test coverage for edge cases (e.g., partial reuse, extremely large indexes) to confirm the robustness of the approach.

Would you like a shell script to locate or run existing integration and unit tests for these new code paths, ensuring full coverage of both nominal and edge scenarios?

Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please make sure to enable immediate indexing on CI-benchmarks

@xzfc xzfc merged commit 89d399e into dev Apr 9, 2025
17 checks passed
@xzfc xzfc deleted the incremental-hnsw-append branch April 9, 2025 18:56
@xzfc
Copy link
Contributor Author

xzfc commented Apr 9, 2025

@agourlay

Where you able to run the all CI suite with the flag enabled?

Checked in a temporary branch (look for 17 / 17 in the GitHub UI to see CI): 42cfdb1

Will enable it by default in a separate PR.

@coderabbitai coderabbitai bot mentioned this pull request Apr 14, 2025
9 tasks
pull bot pushed a commit to kp-forks/qdrant that referenced this pull request Apr 21, 2025
* Pass FeatureFlags into VectorIndexBuildArgs

* Incremental HNSW index building: append-only case

* Use debug_assert

* first_few_ids

* Check deleted_point_count

* Drop unused method
@coderabbitai coderabbitai bot mentioned this pull request May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants