-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Incremental HNSW index building: append-only case #6325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis pull request introduces updates that integrate a new feature flag for incremental HNSW building and propagates its usage across several modules. A boolean field, Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
🔇 Additional comments (2)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/segment/tests/integration/exact_search_test.rs (1)
1-219
: Consider adding test cases for incremental HNSW buildingWhile the feature flag integration looks good, it would be valuable to add specific test cases that enable the
incremental_hnsw_building
flag to verify the functionality works as expected. This would help ensure the new incremental building path is properly tested.lib/segment/src/index/hnsw_index/hnsw.rs (2)
244-256
: Check for old index reusability conditionally onfeature_flags
.
This block properly:
- Ensures
incremental_hnsw_building
is enabled before attempting reuse,- Evaluates prospective old indexes,
- Selects the one yielding the maximum
valid_points
.Consider logging a warning or returning earlier if multiple old indexes are passed in, clarifying that only one is reused.
373-384
: Selective reuse of old links.
Reusing old links if found, or falling back to fresh insertion matches the PR’s append-only approach. Consider adding a debug log entry for each link reuse for better visibility in production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (21)
lib/common/common/src/flags.rs
(3 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/benches/multi_vector_search.rs
(2 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(1 hunks)lib/segment/src/index/hnsw_index/graph_links.rs
(1 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(7 hunks)lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs
(2 hunks)lib/segment/src/segment_constructor/segment_builder.rs
(1 hunks)lib/segment/src/segment_constructor/segment_constructor_base.rs
(2 hunks)lib/segment/tests/integration/batch_search_test.rs
(2 hunks)lib/segment/tests/integration/byte_storage_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/byte_storage_quantization_test.rs
(2 hunks)lib/segment/tests/integration/exact_search_test.rs
(2 hunks)lib/segment/tests/integration/filtrable_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/hnsw_discover_test.rs
(3 hunks)lib/segment/tests/integration/hnsw_incremental_build.rs
(1 hunks)lib/segment/tests/integration/hnsw_quantized_search_test.rs
(4 hunks)lib/segment/tests/integration/main.rs
(1 hunks)lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/multivector_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/multivector_quantization_test.rs
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (7)
lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2)
src/settings.rs (2)
default
(96-102)default
(124-132)lib/common/common/src/budget.rs (1)
default
(239-243)
lib/segment/tests/integration/byte_storage_quantization_test.rs (2)
lib/segment/src/types.rs (5)
default
(1109-1118)default
(1122-1124)default
(1529-1531)default
(2737-2739)default
(2770-2772)lib/common/common/src/budget.rs (1)
default
(239-243)
lib/segment/src/segment_constructor/segment_builder.rs (1)
lib/common/common/src/flags.rs (1)
feature_flags
(89-97)
lib/segment/src/segment_constructor/segment_constructor_base.rs (1)
lib/common/common/src/flags.rs (1)
feature_flags
(89-97)
lib/segment/tests/integration/main.rs (1)
lib/segment/tests/integration/hnsw_incremental_build.rs (1)
hnsw_incremental_build
(39-95)
lib/segment/tests/integration/hnsw_incremental_build.rs (2)
lib/segment/tests/integration/hnsw_quantized_search_test.rs (3)
hnsw_quantized_search_test
(44-181)check_matches
(183-222)query_vectors
(191-200)lib/segment/src/index/hnsw_index/hnsw.rs (2)
new
(99-109)build
(184-545)
lib/segment/src/index/hnsw_index/hnsw.rs (2)
lib/segment/src/index/hnsw_index/graph_links.rs (2)
links
(126-128)point_level
(130-132)lib/segment/src/index/hnsw_index/graph_layers.rs (1)
point_level
(200-202)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consensus
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test-consensus-compose
- GitHub Check: test (ubuntu-latest)
🔇 Additional comments (59)
lib/segment/tests/integration/main.rs (1)
13-13
: LGTM! The new test module for incremental HNSW building aligns with PR objectives.The addition of the
hnsw_incremental_build
module is a good approach to testing the new incremental HNSW building feature described in the PR objectives.lib/segment/Cargo.toml (1)
98-98
: LGTM! Adding the tap crate as a dependency.The
tap
crate is a utility library that allows for performing side effects on values within a method chain, which can be useful for incremental building operations.lib/segment/tests/integration/byte_storage_hnsw_test.rs (2)
6-6
: LGTM! Necessary import for new feature flags.This import is required to use the FeatureFlags structure which will control the incremental HNSW building functionality.
241-241
: LGTM! Adding feature flags parameter to HNSW build arguments.This change passes the feature flags to the HNSW index builder, which enables controlling the incremental building behavior via configuration rather than hard-coding it.
lib/segment/benches/multi_vector_search.rs (2)
7-7
: LGTM! Importing FeatureFlags for benchmark configuration.The import is necessary to pass feature flags to the HNSW builder.
97-97
: LGTM! Adding feature flags to benchmark HNSW index construction.This change ensures that benchmarks use the same builder interface as production code, maintaining consistency across the codebase with the new feature flag parameter.
lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs (2)
6-6
: LGTM: Added the FeatureFlags import.This import is necessary to support the new feature_flags parameter for HNSW building.
83-83
: LGTM: Added feature_flags parameter.This change properly passes the feature flags configuration to the HNSW index builder. Using the default value ensures backwards compatibility in tests while allowing the new incremental HNSW building feature to be toggled.
lib/common/common/src/flags.rs (3)
41-43
: LGTM: Added the incremental_hnsw_building flag.This new feature flag is appropriately documented and follows the established pattern with the
#[serde(default)]
attribute to ensure it defaults to false when deserializing.
54-54
: LGTM: Updated is_empty method for the new flag.The is_empty method is correctly updated to check the new incremental_hnsw_building flag.
Also applies to: 59-59
71-71
: LGTM: Updated init_feature_flags to handle the new flag.The init_feature_flags function correctly sets the new incremental_hnsw_building flag to true when the all flag is activated.
Also applies to: 79-79
lib/segment/tests/integration/multivector_hnsw_test.rs (2)
7-7
: LGTM: Added the FeatureFlags import.This import enables the use of feature flags in the multivector HNSW tests.
152-152
: LGTM: Added feature_flags parameter.This change properly passes the feature flags configuration to the HNSW index builder in the multivector test. Using the default value ensures the test behaves consistently with the current implementation while allowing the incremental HNSW building feature to be toggled.
lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs (2)
6-6
: LGTM: Added the FeatureFlags import.This import enables the use of feature flags in the multivector filterable HNSW tests.
209-209
: LGTM: Added feature_flags parameter.This change properly passes the feature flags configuration to the HNSW index builder in the multivector filterable test. Using the default value ensures backward compatibility while allowing the new incremental HNSW building feature to be toggled when needed.
lib/segment/tests/integration/multivector_quantization_test.rs (2)
8-8
: Import of FeatureFlags for incremental HNSW buildingAdding the
FeatureFlags
import to support the new incremental HNSW building feature flag.
343-343
: Added feature flags parameter to HNSW index buildThe
feature_flags
parameter is properly added with the default value, which means incremental HNSW building will be disabled by default in tests. This matches the PR's design where incremental building is opt-in.lib/segment/tests/integration/hnsw_discover_test.rs (3)
6-6
: Import of FeatureFlags for incremental HNSW buildingAdding the
FeatureFlags
import to support the new incremental HNSW building feature flag.
118-118
: Added feature flags parameter in hnsw_discover_precision testThe
feature_flags
parameter is properly added with the default value, ensuring consistent behavior with the rest of the test suite.
242-242
: Added feature flags parameter in filtered_hnsw_discover_precision testThe
feature_flags
parameter is properly added with the default value to the second test function, maintaining consistency throughout the file.lib/segment/tests/integration/filtrable_hnsw_test.rs (2)
7-7
: Import of FeatureFlags for incremental HNSW buildingAdding the
FeatureFlags
import to support the new incremental HNSW building feature flag.
204-204
: Added feature flags parameter to HNSW index buildThe
feature_flags
parameter is correctly added with the default value. This ensures consistency across test files and properly tests the non-incremental building path by default.lib/segment/tests/integration/exact_search_test.rs (2)
7-7
: Import of FeatureFlags for incremental HNSW buildingAdding the
FeatureFlags
import to support the new incremental HNSW building feature flag.
145-145
: Added feature flags parameter to HNSW index buildThe
feature_flags
parameter is properly added with the default value, which matches the implementation in other test files and ensures consistent testing behavior.lib/segment/tests/integration/batch_search_test.rs (2)
6-6
: Good addition of the FeatureFlags import.This import is necessary to support the new incremental HNSW building feature flag parameter that's being passed to the HNSWIndex::build function.
166-166
: Good addition of the feature_flags parameter.The feature_flags parameter is properly passed to HNSWIndex::build with default values, which aligns with the PR's implementation of incremental HNSW building (disabled by default).
lib/segment/src/segment_constructor/segment_builder.rs (1)
583-583
: Good addition of the feature_flags parameter.This change properly passes the feature flags to the build_vector_index function, which is necessary for the incremental HNSW building feature. Using the feature_flags() function ensures that the globally configured flags are used.
lib/segment/tests/integration/byte_storage_quantization_test.rs (2)
7-7
: Good addition of the FeatureFlags import.This import is consistent with the other test files and enables the use of feature flags in the test.
372-372
: Good addition of the feature_flags parameter.The feature_flags parameter is correctly passed to HNSWIndex::build with default values, which is consistent with the other tests and ensures the incremental HNSW building feature is properly tested (disabled by default).
lib/segment/src/segment_constructor/segment_constructor_base.rs (2)
10-10
: Good update to the import statement.The import statement is correctly updated to include both FeatureFlags and feature_flags, which is necessary for defining the new field in the VectorIndexBuildArgs struct.
388-388
: Good addition of the feature_flags field.Adding the feature_flags field to the VectorIndexBuildArgs struct is a well-structured approach that enables passing the incremental HNSW building flag to the index builder. This aligns with the PR objective of implementing incremental HNSW building for the append-only case.
lib/segment/src/index/hnsw_index/graph_links.rs (1)
17-18
: Clean modularization of exportsThe change improves the module's organization by clearly separating public exports from internal usage.
LinksIterator
is now publicly re-exported, whileCompressionInfo
andGraphLinksView
are kept as internal imports.lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)
406-407
: Improved state validation with assertionThe modification to assert that points aren't double-marked as ready is excellent defensive programming. This validates that the index state is consistent, which is crucial for the incremental building feature where points from old indices must be tracked carefully.
415-432
: Key functionality for incremental HNSW buildingThis new method is central to the incremental HNSW building feature. It allows adding a point with pre-existing links directly, rather than computing new links. The implementation correctly:
- Verifies the level structure matches expectations
- Preserves links from the old graph
- Maintains state consistency with the ready list
- Sets up entry points correctly
This is essential for reusing graph structures from previous indices.
434-440
: Useful optimization for completing indexingThis helper method allows marking a point as ready and adding it to entry points without requiring a scorer. This is an optimization that avoids unnecessary computation when we already know a point's connections from a previous index.
lib/segment/tests/integration/hnsw_incremental_build.rs (5)
1-29
: Comprehensive imports for integration testThe imports cover all the necessary components for testing incremental HNSW building. Good practice to import specific types rather than using wildcard imports.
30-37
: Well-defined test constantsThe constants clearly define the test parameters, making it easy to adjust them for different test scenarios.
38-95
: Thorough integration test for incremental buildingThis test effectively validates the incremental HNSW building functionality by:
- Creating sequential segments with increasing data points
- Building indices with incremental reuse
- Verifying search accuracy at each step
The use of
last_index
to track and reuse previous indices simulates a real-world incremental update scenario.
97-117
: Well-structured segment creation helperGood separation of concerns with this helper function that handles segment creation. The shuffling of IDs helps ensure the test is robust against order dependencies.
119-160
: Key test function enabling incremental buildingThis function is crucial for testing, as it:
- Creates an HNSW index with appropriate configuration
- Explicitly enables the
incremental_hnsw_building
feature flag- Passes previous indices for potential reuse
This validates the core functionality described in the PR objectives.
lib/segment/tests/integration/hnsw_quantized_search_test.rs (4)
9-9
: Required feature flag importAdding the import for FeatureFlags is necessary for the updated function signature.
36-36
: Exposed utility function for test reuseMaking
sames_count
public allows it to be reused in other test files, specifically in the newhnsw_incremental_build.rs
test.
139-139
: Added feature flags parameterAdding the default feature flags parameter maintains compatibility with the updated
VectorIndexBuildArgs
signature. This ensures the test works with the updated API without changing behavior.
183-183
: Exposed test function for reuseMaking
check_matches
public allows it to be reused in other test files, specifically in the newhnsw_incremental_build.rs
test.lib/segment/src/index/hnsw_index/hnsw.rs (15)
8-8
: No issues with the atomic reference usage.
ImportingAtomicRef
andAtomicRefCell
is appropriate for shared mutable data in this context.
16-16
: Good use ofEitherOrBoth
for merged iteration.
This import simplifies handling merged lists of points from both new and old indexes.
46-46
: Essential imports for indexing logic.
Needed for referencingPayloadIndex
,VectorIndex
, andVectorIndexEnum
in new code paths.
209-209
: Newold_indices
parameter.
Introducing theold_indices
parameter is consistent with enabling incremental HNSW index reuse.
212-212
: Newfeature_flags
parameter.
Includingfeature_flags
in the build function is a clean way to check if incremental building is enabled.
318-318
: Reusing the old index if available.
The concise use of.map(...)
to create anOldIndex
instance neatly allows fallback to fresh construction whenNone
.
324-324
: No functional update.
This blank line adds slight readability; no further remark needed.
325-329
: Graceful fallback to randomly assigned level.
Usingold_index
'spoint_level
if present, otherwise callingget_random_layer
, elegantly balances reuse with new generation.
386-389
: Splitting off a subset of points for single-thread building.
Retaining the single-thread strategy for a final batch is consistent with the existing HNSW build approach and can help avoid graph fragmentation.
424-424
: No functional change.
Blank line presumably for readability.
425-425
: Closure-based insertion logic remains clear.
Defining the closure here is straightforward and keeps the insertion logic modular.
1361-1371
:OldIndexCandidate
struct definition.
Storing the borrowedHNSWIndex
, mapping array, andvalid_points
provides a clear anchor for deciding index reusability.
1373-1383
:OldIndex
struct definition.
Maintaining both forward and reverse mappings (old→new and new→old) is crucial for link transformations in incremental building.
1490-1511
: Implementation ofOldIndex
.
These methods provide:
point_level
: Maps a new ID back to the old ID and retrieves its level.get_links
: Retrieves old links, truncates them perm
orm0
, and remaps them to new IDs.Both methods cleanly handle partial usage of the old graph while respecting updated capacity constraints.
1385-1488
:❓ Verification inconclusive
Implementation of
OldIndexCandidate
.
- The
evaluate
method:
- Correctly checks for config mismatches and bailouts if old vectors mismatch or are absent in the new set.
- Appropriately defers any changed or removed vector scenario by returning
None
.- The
reuse
method:
- Constructs reverse mappings and logs the reuse count.
- Seamlessly transitions
OldIndexCandidate
into a fully realizedOldIndex
.Potential dimension mismatch check
Currently, dimension differences do not appear to be validated withinevaluate
. If multiple vector storages with different dimensions were used, it may lead to silent issues.Below is a shell script to locate any dimension checks in the codebase. If none are found, consider adding an explicit dimension check:
🏁 Script executed:
#!/bin/bash rg -i "dimension" --context 5Length of output: 57662
Revisit Dimension Consistency in the OldIndexCandidate Evaluation
The implementation in
OldIndexCandidate
follows the intended logic by comparing configuration parameters and ensuring that only unchanged vectors are reused. However, note that there is no explicit check for vector dimensionality within theevaluate
method. Although the overall codebase performs dimension validations elsewhere (for example, inlib/segment/src/common/mod.rs
and the vector creation routines), if vector storages from differing sources could potentially be mixed, a missing early dimension check here might lead to silent mismatches.
- The current equality check between
old_vector
andnew_vector
will reject vectors that differ (including due to a dimension mismatch), but the lack of an explicit dimension verification means the issue might not be reported clearly.- Consider whether an early, explicit check confirming that the vectors’ dimensions match (using the expected dimension from the configuration) could provide more precise error reporting or safety in cases where vector storages might have diverging dimensions.
031be83
to
1457a98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
434-440
: Consider adding more detailed documentation.The
link_new_point_complete_indexing
method provides a way to mark a point as ready without requiring a scorer. However, it could benefit from more detailed documentation explaining its specific use case and when it should be used instead of the other methods.- pub fn link_new_point_complete_indexing(&self, point_id: PointOffsetType) { + /// Marks a point as ready and adds it to entry points without connecting links or requiring a scorer. + /// This is used in incremental HNSW building after a point's links have been added separately. + pub fn link_new_point_complete_indexing(&self, point_id: PointOffsetType) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (9)
lib/common/common/src/flags.rs
(3 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(1 hunks)lib/segment/src/index/hnsw_index/graph_links.rs
(1 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(7 hunks)lib/segment/tests/integration/gpu_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/hnsw_incremental_build.rs
(1 hunks)lib/segment/tests/integration/hnsw_quantized_search_test.rs
(4 hunks)lib/segment/tests/integration/main.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- lib/segment/tests/integration/main.rs
- lib/segment/Cargo.toml
- lib/common/common/src/flags.rs
- lib/segment/src/index/hnsw_index/graph_links.rs
- lib/segment/tests/integration/hnsw_incremental_build.rs
- lib/segment/tests/integration/hnsw_quantized_search_test.rs
- lib/segment/src/index/hnsw_index/hnsw.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (4)
lib/segment/tests/integration/gpu_hnsw_test.rs (2)
5-6
: Import added for new feature flag support.The addition of
FeatureFlags
import is necessary to support the incremental HNSW building feature in the test file.
145-145
: Feature flags correctly added to the index build arguments.The
feature_flags
parameter is now properly passed to the HNSW index build function, using the default value which disables incremental building. This is appropriate since the PR mentions GPU building isn't fully supported with incremental HNSW.lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
407-408
: Improved state management with assertion check.The code now verifies that the point wasn't already marked as ready, which helps catch potential bugs where a point might be added twice. This is good defensive programming.
415-433
: Well-implemented method for adding points with pre-existing links.This new method is the core of the incremental HNSW building feature. It correctly:
- Takes a point ID and pre-existing links
- Verifies the levels match the expected level
- Populates the links for each level
- Marks the point as ready and adds it to entry points
The documentation appropriately mentions mutual exclusivity with
link_new_point
.
@xzfc please resolve conflict |
1457a98
to
0942ff7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
415-432
: New method for incremental HNSW building.This new method allows adding a point with pre-existing links, which is essential for incremental HNSW index building. It appropriately validates that the provided levels match the expected point level and handles the link assignments correctly.
However, consider adding error handling instead of using assertions for input validation in this public method to make it more robust in production environments.
pub fn add_new_point(&self, point_id: PointOffsetType, levels: Vec<Vec<PointOffsetType>>) -> OperationResult<()> { let level = self.get_point_level(point_id); if levels.len() != level + 1 { - assert_eq!(levels.len(), level + 1); + return Err(OperationError::service_error(format!( + "Invalid levels length: expected {}, got {}", + level + 1, + levels.len() + ))); } // Rest of the method... Ok(()) }lib/segment/tests/integration/hnsw_incremental_build.rs (2)
38-95
: Well-structured test for incremental HNSW building.This test effectively validates the incremental HNSW index building functionality by iteratively adding more data points and ensuring search quality remains acceptable. The test properly enables the
incremental_hnsw_building
flag and validates results usingcheck_matches
.Consider adding assertions to verify that incremental building is actually being used, such as logging the number of reused points and asserting that it's greater than zero in subsequent iterations.
let index = build_hnsw_index(&index_path, &segment, &old_indices); + // Only check for reuse after the first iteration + if i > 1 { + // This assumes some kind of logging or metric about reused points + // The actual implementation would need to expose this information + assert!(index.get_reused_points_count() > 0, + "No points were reused in iteration {}", i); + } let ef = 64;
119-160
: Proper HNSW index building with feature flag enabled.The implementation correctly sets up the HNSW index build configuration and enables the incremental building feature flag. Using a constant for CPU resources via
num_rayon_threads
is a good practice for testing.It would be beneficial to also test the case where
incremental_hnsw_building
is set tofalse
to ensure the system falls back to traditional building correctly.#[test] fn hnsw_incremental_build() { // ... existing test ... } +#[test] +fn hnsw_non_incremental_build() { + // Same setup as hnsw_incremental_build but with: + flags.incremental_hnsw_building = false; + // ... and then verify results are still correct +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (22)
lib/common/common/src/flags.rs
(3 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/benches/multi_vector_search.rs
(2 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(1 hunks)lib/segment/src/index/hnsw_index/graph_links.rs
(1 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(7 hunks)lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs
(2 hunks)lib/segment/src/segment_constructor/segment_builder.rs
(1 hunks)lib/segment/src/segment_constructor/segment_constructor_base.rs
(2 hunks)lib/segment/tests/integration/batch_search_test.rs
(2 hunks)lib/segment/tests/integration/byte_storage_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/byte_storage_quantization_test.rs
(2 hunks)lib/segment/tests/integration/exact_search_test.rs
(2 hunks)lib/segment/tests/integration/filtrable_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/gpu_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/hnsw_discover_test.rs
(3 hunks)lib/segment/tests/integration/hnsw_incremental_build.rs
(1 hunks)lib/segment/tests/integration/hnsw_quantized_search_test.rs
(4 hunks)lib/segment/tests/integration/main.rs
(1 hunks)lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/multivector_hnsw_test.rs
(2 hunks)lib/segment/tests/integration/multivector_quantization_test.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (20)
- lib/segment/tests/integration/byte_storage_hnsw_test.rs
- lib/segment/tests/integration/main.rs
- lib/segment/benches/multi_vector_search.rs
- lib/segment/tests/integration/filtrable_hnsw_test.rs
- lib/segment/src/segment_constructor/segment_constructor_base.rs
- lib/segment/src/index/hnsw_index/graph_links.rs
- lib/segment/Cargo.toml
- lib/segment/src/segment_constructor/segment_builder.rs
- lib/segment/tests/integration/gpu_hnsw_test.rs
- lib/segment/tests/integration/multivector_filtrable_hnsw_test.rs
- lib/segment/tests/integration/batch_search_test.rs
- lib/segment/tests/integration/multivector_quantization_test.rs
- lib/segment/tests/integration/exact_search_test.rs
- lib/segment/tests/integration/multivector_hnsw_test.rs
- lib/segment/src/index/hnsw_index/tests/test_graph_connectivity.rs
- lib/segment/tests/integration/byte_storage_quantization_test.rs
- lib/common/common/src/flags.rs
- lib/segment/tests/integration/hnsw_discover_test.rs
- lib/segment/tests/integration/hnsw_quantized_search_test.rs
- lib/segment/src/index/hnsw_index/hnsw.rs
🧰 Additional context used
🧬 Code Definitions (1)
lib/segment/tests/integration/hnsw_incremental_build.rs (2)
lib/segment/tests/integration/hnsw_quantized_search_test.rs (3)
hnsw_quantized_search_test
(44-181)check_matches
(183-222)query_vectors
(191-200)lib/segment/src/index/hnsw_index/hnsw.rs (2)
new
(99-109)build
(184-545)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
406-407
: Added safety check to prevent double linking.This change ensures that
link_new_point
cannot be called twice on the same point by asserting that the point wasn't already ready. This helps maintain the data structure's integrity.
434-440
: Added helper method for completing indexing.This function provides a streamlined way to mark a point as indexed without needing a scorer, which is useful for the incremental building workflow.
lib/segment/tests/integration/hnsw_incremental_build.rs (1)
97-117
: Good segment creation helper.The function properly randomizes the insertion order of vectors, which helps test resilience against different insertion patterns. Using
hw_counter
for hardware resource tracking is a good practice.
I was only able to partially test this locally. So thanks for adding those tests! The results actually look fantastic. I wonder how this would affect bulk upsertions without disabling HNSW building. |
@@ -403,14 +403,42 @@ impl GraphLayersBuilder { | |||
// New point is a new empty entry (for this filter, at least) | |||
// We can't do much here, so just quit | |||
} | |||
self.ready_list.write().set(point_id as usize, true); | |||
let was_ready = self.ready_list.write().replace(point_id as usize, true); | |||
assert!(!was_ready); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug assert or do we want to crash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, debug assert + log::error if we really care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with debug_assert
.
let old_index = old_indices | ||
.iter() | ||
.filter_map(|old_index| { | ||
feature_flags.incremental_hnsw_building.then_some(())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the feature flag repeatedly checked?
Can it be changed without restarting the service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it's cheap (just a bool
) I agree it might be better to not iterate at all if the flag is not set.
Can it be changed without restarting the service?
No, at least not in the duration of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's optimized for easy removal. We might want to make it unconditional soon. So, removing this check would be an one-line change.
Where you able to run the all CI suite with the flag enabled? |
.take(SINGLE_THREADED_HNSW_BUILD_THRESHOLD) | ||
.collect(); | ||
let ids: Vec<_> = ids_iterator.collect(); | ||
let first_few_ids = ids.split_off( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first_few_ids
is only relevant if the graph is empty. Potentially we can skip this step if we have already added enough points with add_new_point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)
406-407
: Clarify handling if point is already ready.Currently,
debug_assert!(!was_ready);
ensures in debug mode that the point wasn’t previously marked as ready. In release mode, however, this condition won’t be enforced. If there is a possibility of re-linking or partial insertion in the future, consider logging or error handling to prevent unexpected double-insertion scenarios.
415-432
: Validate neighbor IDs when adding pre-existing links.While this function is convenient for incremental or bulk additions, consider validating that each neighbor ID is valid and marked as “ready” (or ensuring they will be set ready) before assignment. This reduces the risk of missing or invalid links when reusing partial graphs.
434-440
: Ensure naming clarity for finalizing point indexing.
link_new_point_complete_indexing
simply sets the point as ready and adds an entry point without performing link logic. Consider renaming it (e.g.,mark_point_indexed
) or merging its functionality withadd_new_point
if future usage patterns are similar, to avoid confusion.lib/segment/src/index/hnsw_index/hnsw.rs (2)
244-255
: Incremental index selection logic.Reusing the best old index candidate by selecting the one with the highest
valid_points
is effective. Consider logging the number of reused vs. total indexed points to provide more visibility into performance gains or fallback scenarios when old index reuse isn’t possible.Also applies to: 318-318
373-374
: Efficiently combining reused links with fresh insertions.The approach splits vectors that can reuse old links vs. those inserted anew in single-thread or parallel modes. This design is sensible; however, if the first-few-IDs logic grows or changes, you might consider unifying it with the old index logic for clarity.
Also applies to: 376-390
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(1 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/hnsw_index/hnsw.rs (9)
lib/segment/src/vector_storage/dense/appendable_dense_vector_storage.rs (1)
total_vector_count
(103-105)lib/segment/src/vector_storage/multi_dense/appendable_mmap_multi_dense_vector_storage.rs (1)
total_vector_count
(156-158)lib/segment/src/vector_storage/multi_dense/simple_multi_dense_vector_storage.rs (1)
total_vector_count
(348-350)lib/segment/src/vector_storage/vector_storage_base.rs (2)
total_vector_count
(47-47)total_vector_count
(519-545)lib/segment/src/vector_storage/sparse/simple_sparse_vector_storage.rs (1)
total_vector_count
(185-187)lib/segment/src/index/hnsw_index/graph_links.rs (2)
links
(126-128)point_level
(130-132)lib/segment/src/id_tracker/id_tracker_base.rs (2)
total_point_count
(109-109)total_point_count
(332-339)lib/segment/src/segment/segment_ops.rs (1)
total_point_count
(652-654)lib/segment/src/index/hnsw_index/graph_layers.rs (1)
point_level
(200-202)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: lint
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (5)
lib/segment/src/index/hnsw_index/hnsw.rs (5)
8-8
: Imports for incremental functionality look valid.The newly introduced imports (
atomic_refcell
,itertools::EitherOrBoth
,VectorIndexEnum
) align with appended logic for incremental HNSW building. No immediate issues noted.Also applies to: 16-16, 46-46
209-209
: New build parameters enable incremental HNSW building support.Adding
old_indices
andfeature_flags
toVectorIndexBuildArgs
appears correct. Ensure existing code paths that don’t supply these fields either set them to defaults or handle them gracefully.Also applies to: 212-212
325-328
: Fallback to random level is consistent.When old index information isn’t available, defaulting to a random level is logical. The approach is straightforward, and no issues stand out with the fallback path.
432-433
: Explicitly dropping old_index.Dropping
old_index
if the main graph is already built on the GPU is clear. This is minor but maintains clarity that old index usage is no longer needed. No further concerns here.
1364-1525
: Comprehensive incremental reuse structures.
OldIndexCandidate
andOldIndex
provide a solid foundation for append-only reuse. The logic to map old IDs to new IDs is straightforward and addresses only unchanged, undeleted points, which suits an append-only model. Verify test coverage for edge cases (e.g., partial reuse, extremely large indexes) to confirm the robustness of the approach.Would you like a shell script to locate or run existing integration and unit tests for these new code paths, ensuring full coverage of both nominal and edge scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please make sure to enable immediate indexing on CI-benchmarks
* Pass FeatureFlags into VectorIndexBuildArgs * Incremental HNSW index building: append-only case * Use debug_assert * first_few_ids * Check deleted_point_count * Drop unused method
This PR implements incremental HNSW building for the append-only case.
A new feature flag is added,
incremental_hnsw_building
, disabled by default.When enabled, it will try to reuse old graph as a starting point for building the new graph. When reusing, point levels and old graph links are copied as-is (but with reassigned IDs) before the regular HNSW point insertion process starts.
What is not supported as of now:
When old index is reused, the following log messages are printed:
Benchmarked using
bfb --indexing-threshold 400 --dim 32 --num-vectors 2000000 --throttle 500
.Upload time: 40 seconds.
Upload+index (incremental): 40+17 = 57 seconds.
Upload+index (non-incremental): 40+84 = 124 seconds.