Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479

generall · 2025-05-01T14:53:54Z

H&M benchmark: https://github.com/qdrant/vector-db-benchmark/actions/runs/14785548595

ToDo:

Check that this feature doesn't break multitenancy
Check speedup / accuracy on other datasets, ideally for each datatype
Estimate indexing overhead for cases, where no shortcuts were possible

More benchmarks in #6571

* Improve connectivity check in HNSW index * Add budget for checks * Properly set visited flag * Change budget approach * review --------- Co-authored-by: generall <andrey@vasnetsov.com>

coderabbitai · 2025-05-30T13:01:11Z

📝 Walkthrough

Walkthrough

The changes introduce new methods and logic to improve the efficiency and flexibility of HNSW graph construction and analysis. A method for randomly selecting an entry point based on a filter is added to the EntryPoints struct. The GraphLayersBuilder struct gains a method to estimate the connectivity of a subgraph, simulating random edge failures. In the main HNSW index logic, a connectivity-based check is implemented to decide whether to skip building additional subgraphs for payload-indexed fields if they are already sufficiently connected, based on percolation theory. The StructPayloadIndex struct now includes a method to check if a field is marked as a tenant.

Suggested reviewers

generall
IvanPleshkov

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

lib/segment/src/index/hnsw_index/entry_points.rs (1)

114-147: Remove redundant clone() calls to improve performance.

The implementation correctly adds randomized entry point selection, but there are unnecessary clone operations that can be removed.

Apply this diff to remove redundant clones:

 pub fn get_random_entry_point<F, R: Rng + ?Sized>(
     &self,
     rnd: &mut R,
     checker: F,
 ) -> Option<EntryPoint>
 where
     F: Fn(PointOffsetType) -> bool,
 {
     let filtered_entry_points: Vec<_> = self
         .entry_points
         .iter()
         .filter(|entry| checker(entry.point_id))
         .cloned()
         .collect();

     if !filtered_entry_points.is_empty() {
         let random_index = rnd.random_range(0..filtered_entry_points.len());
-        return Some(filtered_entry_points[random_index].clone());
+        return Some(filtered_entry_points[random_index]);
     }

     let filtered_extra_entry_points: Vec<_> = self
         .extra_entry_points
         .iter_unsorted()
         .filter(|entry| checker(entry.point_id))
         .cloned()
         .collect();

     if !filtered_extra_entry_points.is_empty() {
         let random_index = rnd.random_range(0..filtered_extra_entry_points.len());
-        return Some(filtered_extra_entry_points[random_index].clone());
+        return Some(filtered_extra_entry_points[random_index]);
     }

     None
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8fd25c and 1e818e4.

📒 Files selected for processing (5)

lib/common/common/src/ext.rs (2 hunks)
lib/segment/src/index/hnsw_index/entry_points.rs (2 hunks)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2 hunks)
lib/segment/src/index/hnsw_index/hnsw.rs (6 hunks)
lib/segment/src/index/struct_payload_index.rs (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lib/segment/src/index/struct_payload_index.rs (1)

lib/segment/src/types.rs (1)

is_tenant (1860-1865)

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: test-low-resources
GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: lint

🔇 Additional comments (9)

lib/common/common/src/ext.rs (1)

26-47: LGTM! Efficient bit manipulation implementation.

The set_all method provides an efficient way to set all bits in a BitVec by operating directly on the underlying raw storage. This is a good foundation for the connectivity computations in the HNSW index modules.

lib/segment/src/index/struct_payload_index.rs (1)

381-387: LGTM! Clean implementation of tenant field checking.

The method correctly checks if a field is marked as a tenant by looking it up in the indexed fields configuration and delegating to the schema's is_tenant() method.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

75-76: LGTM! Reasonable budget constant for connectivity estimation.

The value of 64 provides a good balance between accuracy and performance for the connectivity approximation.

lib/segment/src/index/hnsw_index/hnsw.rs (6)

19-19: LGTM!

The addition of trace logging is appropriate for the detailed connectivity debugging added in this PR.

457-461: LGTM!

Proper initialization of payload_m0 with fallback and early exit optimization when additional graph building is not needed.

469-500: Excellent implementation of connectivity estimation!

The percolation theory-based approach is well-implemented:

Smart choice to sample at 2/K instead of 1/K for stability

Taking the maximum of 3 samples effectively handles noise

The threshold check (>= 4) ensures meaningful measurements

Clear documentation explains the mathematical reasoning

529-572: Well-structured connectivity-based optimization!

The implementation correctly:

Always builds graphs for tenant fields (multitenancy requirement)

Skips the first field from connectivity checks

Uses trace logging for debugging the optimization decisions

Properly handles the optional connectivity threshold

653-678: Good refactoring to improve code modularity!

Extracting the point filtering logic into a separate function improves code organization and reusability. The implementation correctly filters by condition and excludes deleted vectors.

680-695: Clean separation of concerns!

The refactoring properly separates point filtering from graph building logic. Accepting pre-computed points_to_index makes the function more focused and testable.

coderabbitai · 2025-05-30T13:01:14Z

lib/segment/src/index/hnsw_index/graph_layers_builder.rs

+    /// For a given sub-graph defined by points, returns connectivity estimation.
+    /// How it works:
+    ///  - Select entry point, it would be a point with the highest level. If there are several, pick first one.
+    ///  - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
+    ///  - Count number of nodes reachable from the entry point.
+    ///  - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
+    ///  - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
+    ///
+    /// Coin probability `q` is a parameter of this function. By default, it is 0.5.
+    pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {


🛠️ Refactor suggestion

Add parameter validation and improve documentation for the probability parameter.

The connectivity estimation algorithm is well-implemented, but the q parameter should be validated and better documented.

Apply this diff to add validation and improve documentation:

/// For a given sub-graph defined by points, returns connectivity estimation. /// How it works: /// - Select entry point, it would be a point with the highest level. If there are several, pick first one. /// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not. /// - Count number of nodes reachable from the entry point. /// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached. /// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph. /// -/// Coin probability `q` is a parameter of this function. By default, it is 0.5. +/// # Parameters +/// - `points`: The subset of points defining the subgraph +/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal. +/// Higher values result in more optimistic connectivity estimates. pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 { + debug_assert!( + (0.0..=1.0).contains(&q), + "Edge retention probability q must be between 0.0 and 1.0, got {q}" + ); if points.is_empty() { return 1.0; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// For a given sub-graph defined by points, returns connectivity estimation.

/// How it works:

/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.

/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.

/// - Count number of nodes reachable from the entry point.

/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.

/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.

///

/// Coin probability `q` is a parameter of this function. By default, it is 0.5.

pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {

/// For a given sub-graph defined by points, returns connectivity estimation.

/// How it works:

/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.

/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.

/// - Count number of nodes reachable from the entry point.

/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.

/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.

///

/// # Parameters

/// - `points`: The subset of points defining the subgraph

/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal.

/// Higher values result in more optimistic connectivity estimates.

pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {

debug_assert!(

(0.0..=1.0).contains(&q),

"Edge retention probability q must be between 0.0 and 1.0, got {q}"

);

if points.is_empty() {

return 1.0;

}

// ... rest of implementation ...

}

🤖 Prompt for AI Agents

In lib/segment/src/index/hnsw_index/graph_layers_builder.rs around lines 83 to 92, the subgraph_connectivity function's parameter q, representing the coin flip probability, lacks validation and clear documentation. Update the function documentation to explicitly state that q must be between 0.0 and 1.0 inclusive, describing its role as the probability of keeping an edge during BFS. Add a validation check at the start of the function to ensure q is within this range, and if not, handle the error appropriately, such as by returning a default value or panicking with a clear message.

lib/common/common/src/ext.rs

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

83-92: Address past review feedback and fix parameter documentation.

This duplicates the previous review comment about parameter validation and documentation. The q parameter still lacks proper validation and clear documentation.

🧹 Nitpick comments (2)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
99-101: Consider memory efficiency for sparse point ID ranges.

Creating bit vectors with size max_point_id + 1 could be memory-intensive if point IDs are sparse (e.g., if points are [1, 1000000], it allocates for 1M+ elements but only uses 2).

Consider using a HashSet<PointOffsetType> instead of BitVec for sparse point ID ranges, or add a comment explaining the memory trade-off assumption.

98-98: Add safety comment for unwrap usage.

While the unwrap() is safe due to the empty check at line 93, adding a comment would improve code clarity.
-    let max_point_id = *points.iter().max().unwrap();
+    let max_point_id = *points.iter().max().unwrap(); // Safe: points is not empty

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e818e4 and cf009e2.

📒 Files selected for processing (1)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consistency
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: integration-tests
GitHub Check: storage-compat-test
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: lint

🔇 Additional comments (2)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

7-7: LGTM: Import addition supports bit vector operations.

The new import use common::ext::BitSliceExt; is needed for the get_bit method calls used in the subgraph_connectivity method.

75-76: LGTM: Well-documented constant for search budget.

The constant SUBGRAPH_CONNECTIVITY_SEARCH_BUDGET is appropriately named and documented for controlling the minimum search effort in connectivity estimation.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs

* refactor build_filtered_graph * utils * fmt * implement connectivity estimation * Improve connectivity check in HNSW index (#6571) * Use BitVec::fill instead of our custom function

github-actions bot mentioned this pull request May 1, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

JojiiOfficial and others added 4 commits May 22, 2025 09:10

refactor build_filtered_graph

f9815e7

utils

85ca7e0

fmt

f452fe5

implement connectivity estimation

9b6dcaf

JojiiOfficial force-pushed the payload-hnsw-links-connectivity-check branch from 370cfb3 to 9b6dcaf Compare May 22, 2025 07:16

Improve connectivity check in HNSW index (#6571)

1e818e4

* Improve connectivity check in HNSW index * Add budget for checks * Properly set visited flag * Change budget approach * review --------- Co-authored-by: generall <andrey@vasnetsov.com>

generall marked this pull request as ready for review May 30, 2025 12:57

generall requested a review from JojiiOfficial May 30, 2025 12:58

JojiiOfficial approved these changes May 30, 2025

View reviewed changes

coderabbitai bot reviewed May 30, 2025

View reviewed changes

timvisee requested changes May 30, 2025

View reviewed changes

lib/common/common/src/ext.rs Outdated Show resolved Hide resolved

Use BitVec::fill instead of our custom function

cf009e2

JojiiOfficial requested a review from timvisee May 30, 2025 13:18

coderabbitai bot reviewed May 30, 2025

View reviewed changes

lib/segment/src/index/hnsw_index/graph_layers_builder.rs Show resolved Hide resolved

lib/segment/src/index/hnsw_index/graph_layers_builder.rs Show resolved Hide resolved

IvanPleshkov self-requested a review June 4, 2025 10:03

timvisee approved these changes Jun 4, 2025

View reviewed changes

generall merged commit bc5b131 into dev Jun 4, 2025
17 checks passed

generall deleted the payload-hnsw-links-connectivity-check branch June 4, 2025 14:30

generall added a commit that referenced this pull request Jul 17, 2025

Merge pull request #6479

52ca4ff

* refactor build_filtered_graph * utils * fmt * implement connectivity estimation * Improve connectivity check in HNSW index (#6571) * Use BitVec::fill instead of our custom function

This was referenced Jul 22, 2025

[map index] use roaring bitmap in mutable map index #6926

Merged

Do no delete payload index on point deletion to avoid data inconsistency #6930

Closed

Clean up GraphLayers::search_entry_on_level a bit #6946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479

Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479

Uh oh!

generall commented May 1, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented May 30, 2025 •

edited

Loading

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot May 30, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479

Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479

Uh oh!

Conversation

generall commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

generall commented May 1, 2025 •

edited

Loading

coderabbitai bot commented May 30, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)