Skip to content

Conversation

generall
Copy link
Member

@generall generall commented May 1, 2025

H&M benchmark: https://github.com/qdrant/vector-db-benchmark/actions/runs/14785548595


ToDo:

  • Check that this feature doesn't break multitenancy
  • Check speedup / accuracy on other datasets, ideally for each datatype
  • Estimate indexing overhead for cases, where no shortcuts were possible

More benchmarks in #6571

@JojiiOfficial JojiiOfficial force-pushed the payload-hnsw-links-connectivity-check branch from 370cfb3 to 9b6dcaf Compare May 22, 2025 07:16
* Improve connectivity check in HNSW index

* Add budget for checks

* Properly set visited flag

* Change budget approach

* review

---------

Co-authored-by: generall <andrey@vasnetsov.com>
@generall generall marked this pull request as ready for review May 30, 2025 12:57
@generall generall requested a review from JojiiOfficial May 30, 2025 12:58
Copy link
Contributor

coderabbitai bot commented May 30, 2025

📝 Walkthrough

Walkthrough

The changes introduce new methods and logic to improve the efficiency and flexibility of HNSW graph construction and analysis. A method for randomly selecting an entry point based on a filter is added to the EntryPoints struct. The GraphLayersBuilder struct gains a method to estimate the connectivity of a subgraph, simulating random edge failures. In the main HNSW index logic, a connectivity-based check is implemented to decide whether to skip building additional subgraphs for payload-indexed fields if they are already sufficiently connected, based on percolation theory. The StructPayloadIndex struct now includes a method to check if a field is marked as a tenant.

Suggested reviewers

  • generall
  • IvanPleshkov
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/entry_points.rs (1)

114-147: Remove redundant clone() calls to improve performance.

The implementation correctly adds randomized entry point selection, but there are unnecessary clone operations that can be removed.

Apply this diff to remove redundant clones:

 pub fn get_random_entry_point<F, R: Rng + ?Sized>(
     &self,
     rnd: &mut R,
     checker: F,
 ) -> Option<EntryPoint>
 where
     F: Fn(PointOffsetType) -> bool,
 {
     let filtered_entry_points: Vec<_> = self
         .entry_points
         .iter()
         .filter(|entry| checker(entry.point_id))
         .cloned()
         .collect();

     if !filtered_entry_points.is_empty() {
         let random_index = rnd.random_range(0..filtered_entry_points.len());
-        return Some(filtered_entry_points[random_index].clone());
+        return Some(filtered_entry_points[random_index]);
     }

     let filtered_extra_entry_points: Vec<_> = self
         .extra_entry_points
         .iter_unsorted()
         .filter(|entry| checker(entry.point_id))
         .cloned()
         .collect();

     if !filtered_extra_entry_points.is_empty() {
         let random_index = rnd.random_range(0..filtered_extra_entry_points.len());
-        return Some(filtered_extra_entry_points[random_index].clone());
+        return Some(filtered_extra_entry_points[random_index]);
     }

     None
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8fd25c and 1e818e4.

📒 Files selected for processing (5)
  • lib/common/common/src/ext.rs (2 hunks)
  • lib/segment/src/index/hnsw_index/entry_points.rs (2 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (6 hunks)
  • lib/segment/src/index/struct_payload_index.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/struct_payload_index.rs (1)
lib/segment/src/types.rs (1)
  • is_tenant (1860-1865)
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
🔇 Additional comments (9)
lib/common/common/src/ext.rs (1)

26-47: LGTM! Efficient bit manipulation implementation.

The set_all method provides an efficient way to set all bits in a BitVec by operating directly on the underlying raw storage. This is a good foundation for the connectivity computations in the HNSW index modules.

lib/segment/src/index/struct_payload_index.rs (1)

381-387: LGTM! Clean implementation of tenant field checking.

The method correctly checks if a field is marked as a tenant by looking it up in the indexed fields configuration and delegating to the schema's is_tenant() method.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

75-76: LGTM! Reasonable budget constant for connectivity estimation.

The value of 64 provides a good balance between accuracy and performance for the connectivity approximation.

lib/segment/src/index/hnsw_index/hnsw.rs (6)

19-19: LGTM!

The addition of trace logging is appropriate for the detailed connectivity debugging added in this PR.


457-461: LGTM!

Proper initialization of payload_m0 with fallback and early exit optimization when additional graph building is not needed.


469-500: Excellent implementation of connectivity estimation!

The percolation theory-based approach is well-implemented:

  • Smart choice to sample at 2/K instead of 1/K for stability
  • Taking the maximum of 3 samples effectively handles noise
  • The threshold check (>= 4) ensures meaningful measurements
  • Clear documentation explains the mathematical reasoning

529-572: Well-structured connectivity-based optimization!

The implementation correctly:

  • Always builds graphs for tenant fields (multitenancy requirement)
  • Skips the first field from connectivity checks
  • Uses trace logging for debugging the optimization decisions
  • Properly handles the optional connectivity threshold

653-678: Good refactoring to improve code modularity!

Extracting the point filtering logic into a separate function improves code organization and reusability. The implementation correctly filters by condition and excludes deleted vectors.


680-695: Clean separation of concerns!

The refactoring properly separates point filtering from graph building logic. Accepting pre-computed points_to_index makes the function more focused and testable.

Comment on lines +83 to +92
/// For a given sub-graph defined by points, returns connectivity estimation.
/// How it works:
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
/// - Count number of nodes reachable from the entry point.
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
///
/// Coin probability `q` is a parameter of this function. By default, it is 0.5.
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add parameter validation and improve documentation for the probability parameter.

The connectivity estimation algorithm is well-implemented, but the q parameter should be validated and better documented.

Apply this diff to add validation and improve documentation:

 /// For a given sub-graph defined by points, returns connectivity estimation.
 /// How it works:
 ///  - Select entry point, it would be a point with the highest level. If there are several, pick first one.
 ///  - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
 ///  - Count number of nodes reachable from the entry point.
 ///  - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
 ///  - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
 ///
-/// Coin probability `q` is a parameter of this function. By default, it is 0.5.
+/// # Parameters
+/// - `points`: The subset of points defining the subgraph
+/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal.
+///        Higher values result in more optimistic connectivity estimates.
 pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {
+    debug_assert!(
+        (0.0..=1.0).contains(&q),
+        "Edge retention probability q must be between 0.0 and 1.0, got {q}"
+    );
     if points.is_empty() {
         return 1.0;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// For a given sub-graph defined by points, returns connectivity estimation.
/// How it works:
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
/// - Count number of nodes reachable from the entry point.
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
///
/// Coin probability `q` is a parameter of this function. By default, it is 0.5.
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {
/// For a given sub-graph defined by points, returns connectivity estimation.
/// How it works:
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
/// - Count number of nodes reachable from the entry point.
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
///
/// # Parameters
/// - `points`: The subset of points defining the subgraph
/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal.
/// Higher values result in more optimistic connectivity estimates.
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {
debug_assert!(
(0.0..=1.0).contains(&q),
"Edge retention probability q must be between 0.0 and 1.0, got {q}"
);
if points.is_empty() {
return 1.0;
}
// ... rest of implementation ...
}
🤖 Prompt for AI Agents
In lib/segment/src/index/hnsw_index/graph_layers_builder.rs around lines 83 to
92, the subgraph_connectivity function's parameter q, representing the coin flip
probability, lacks validation and clear documentation. Update the function
documentation to explicitly state that q must be between 0.0 and 1.0 inclusive,
describing its role as the probability of keeping an edge during BFS. Add a
validation check at the start of the function to ensure q is within this range,
and if not, handle the error appropriately, such as by returning a default value
or panicking with a clear message.

@JojiiOfficial JojiiOfficial requested a review from timvisee May 30, 2025 13:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

83-92: Address past review feedback and fix parameter documentation.

This duplicates the previous review comment about parameter validation and documentation. The q parameter still lacks proper validation and clear documentation.

🧹 Nitpick comments (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

99-101: Consider memory efficiency for sparse point ID ranges.

Creating bit vectors with size max_point_id + 1 could be memory-intensive if point IDs are sparse (e.g., if points are [1, 1000000], it allocates for 1M+ elements but only uses 2).

Consider using a HashSet<PointOffsetType> instead of BitVec for sparse point ID ranges, or add a comment explaining the memory trade-off assumption.


98-98: Add safety comment for unwrap usage.

While the unwrap() is safe due to the empty check at line 93, adding a comment would improve code clarity.

-    let max_point_id = *points.iter().max().unwrap();
+    let max_point_id = *points.iter().max().unwrap(); // Safe: points is not empty
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e818e4 and cf009e2.

📒 Files selected for processing (1)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: lint
🔇 Additional comments (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

7-7: LGTM: Import addition supports bit vector operations.

The new import use common::ext::BitSliceExt; is needed for the get_bit method calls used in the subgraph_connectivity method.


75-76: LGTM: Well-documented constant for search budget.

The constant SUBGRAPH_CONNECTIVITY_SEARCH_BUDGET is appropriately named and documented for controlling the minimum search effort in connectivity estimation.

@IvanPleshkov IvanPleshkov self-requested a review June 4, 2025 10:03
@generall generall merged commit bc5b131 into dev Jun 4, 2025
17 checks passed
@generall generall deleted the payload-hnsw-links-connectivity-check branch June 4, 2025 14:30
generall added a commit that referenced this pull request Jul 17, 2025
* refactor build_filtered_graph

* utils

* fmt

* implement connectivity estimation

* Improve connectivity check in HNSW index (#6571)

* Use BitVec::fill instead of our custom function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants