-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Attempt to speedup additional links construction by measuring sub-graph connectivity first #6479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
370cfb3
to
9b6dcaf
Compare
* Improve connectivity check in HNSW index * Add budget for checks * Properly set visited flag * Change budget approach * review --------- Co-authored-by: generall <andrey@vasnetsov.com>
📝 WalkthroughWalkthroughThe changes introduce new methods and logic to improve the efficiency and flexibility of HNSW graph construction and analysis. A method for randomly selecting an entry point based on a filter is added to the Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/entry_points.rs (1)
114-147
: Remove redundant clone() calls to improve performance.The implementation correctly adds randomized entry point selection, but there are unnecessary clone operations that can be removed.
Apply this diff to remove redundant clones:
pub fn get_random_entry_point<F, R: Rng + ?Sized>( &self, rnd: &mut R, checker: F, ) -> Option<EntryPoint> where F: Fn(PointOffsetType) -> bool, { let filtered_entry_points: Vec<_> = self .entry_points .iter() .filter(|entry| checker(entry.point_id)) .cloned() .collect(); if !filtered_entry_points.is_empty() { let random_index = rnd.random_range(0..filtered_entry_points.len()); - return Some(filtered_entry_points[random_index].clone()); + return Some(filtered_entry_points[random_index]); } let filtered_extra_entry_points: Vec<_> = self .extra_entry_points .iter_unsorted() .filter(|entry| checker(entry.point_id)) .cloned() .collect(); if !filtered_extra_entry_points.is_empty() { let random_index = rnd.random_range(0..filtered_extra_entry_points.len()); - return Some(filtered_extra_entry_points[random_index].clone()); + return Some(filtered_extra_entry_points[random_index]); } None }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/common/common/src/ext.rs
(2 hunks)lib/segment/src/index/hnsw_index/entry_points.rs
(2 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(2 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(6 hunks)lib/segment/src/index/struct_payload_index.rs
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/struct_payload_index.rs (1)
lib/segment/src/types.rs (1)
is_tenant
(1860-1865)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-low-resources
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
🔇 Additional comments (9)
lib/common/common/src/ext.rs (1)
26-47
: LGTM! Efficient bit manipulation implementation.The
set_all
method provides an efficient way to set all bits in aBitVec
by operating directly on the underlying raw storage. This is a good foundation for the connectivity computations in the HNSW index modules.lib/segment/src/index/struct_payload_index.rs (1)
381-387
: LGTM! Clean implementation of tenant field checking.The method correctly checks if a field is marked as a tenant by looking it up in the indexed fields configuration and delegating to the schema's
is_tenant()
method.lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
75-76
: LGTM! Reasonable budget constant for connectivity estimation.The value of 64 provides a good balance between accuracy and performance for the connectivity approximation.
lib/segment/src/index/hnsw_index/hnsw.rs (6)
19-19
: LGTM!The addition of
trace
logging is appropriate for the detailed connectivity debugging added in this PR.
457-461
: LGTM!Proper initialization of
payload_m0
with fallback and early exit optimization when additional graph building is not needed.
469-500
: Excellent implementation of connectivity estimation!The percolation theory-based approach is well-implemented:
- Smart choice to sample at 2/K instead of 1/K for stability
- Taking the maximum of 3 samples effectively handles noise
- The threshold check (
>= 4
) ensures meaningful measurements- Clear documentation explains the mathematical reasoning
529-572
: Well-structured connectivity-based optimization!The implementation correctly:
- Always builds graphs for tenant fields (multitenancy requirement)
- Skips the first field from connectivity checks
- Uses trace logging for debugging the optimization decisions
- Properly handles the optional connectivity threshold
653-678
: Good refactoring to improve code modularity!Extracting the point filtering logic into a separate function improves code organization and reusability. The implementation correctly filters by condition and excludes deleted vectors.
680-695
: Clean separation of concerns!The refactoring properly separates point filtering from graph building logic. Accepting pre-computed
points_to_index
makes the function more focused and testable.
/// For a given sub-graph defined by points, returns connectivity estimation. | ||
/// How it works: | ||
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one. | ||
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not. | ||
/// - Count number of nodes reachable from the entry point. | ||
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached. | ||
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph. | ||
/// | ||
/// Coin probability `q` is a parameter of this function. By default, it is 0.5. | ||
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add parameter validation and improve documentation for the probability parameter.
The connectivity estimation algorithm is well-implemented, but the q
parameter should be validated and better documented.
Apply this diff to add validation and improve documentation:
/// For a given sub-graph defined by points, returns connectivity estimation.
/// How it works:
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one.
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not.
/// - Count number of nodes reachable from the entry point.
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached.
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph.
///
-/// Coin probability `q` is a parameter of this function. By default, it is 0.5.
+/// # Parameters
+/// - `points`: The subset of points defining the subgraph
+/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal.
+/// Higher values result in more optimistic connectivity estimates.
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 {
+ debug_assert!(
+ (0.0..=1.0).contains(&q),
+ "Edge retention probability q must be between 0.0 and 1.0, got {q}"
+ );
if points.is_empty() {
return 1.0;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// For a given sub-graph defined by points, returns connectivity estimation. | |
/// How it works: | |
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one. | |
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not. | |
/// - Count number of nodes reachable from the entry point. | |
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached. | |
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph. | |
/// | |
/// Coin probability `q` is a parameter of this function. By default, it is 0.5. | |
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 { | |
/// For a given sub-graph defined by points, returns connectivity estimation. | |
/// How it works: | |
/// - Select entry point, it would be a point with the highest level. If there are several, pick first one. | |
/// - Start Breadth-First Search (BFS) from the entry point, on each edge flip a coin to decide if the edge is removed or not. | |
/// - Count number of nodes reachable from the entry point. | |
/// - Use visited points as entry points for the next layer below and repeat until layer 0 has reached. | |
/// - Return the fraction of reachable nodes to the total number of nodes in the sub-graph. | |
/// | |
/// # Parameters | |
/// - `points`: The subset of points defining the subgraph | |
/// - `q`: Edge retention probability (0.0 to 1.0). Each edge is kept with probability `q` during BFS traversal. | |
/// Higher values result in more optimistic connectivity estimates. | |
pub fn subgraph_connectivity(&self, points: &[PointOffsetType], q: f32) -> f32 { | |
debug_assert!( | |
(0.0..=1.0).contains(&q), | |
"Edge retention probability q must be between 0.0 and 1.0, got {q}" | |
); | |
if points.is_empty() { | |
return 1.0; | |
} | |
// ... rest of implementation ... | |
} |
🤖 Prompt for AI Agents
In lib/segment/src/index/hnsw_index/graph_layers_builder.rs around lines 83 to
92, the subgraph_connectivity function's parameter q, representing the coin flip
probability, lacks validation and clear documentation. Update the function
documentation to explicitly state that q must be between 0.0 and 1.0 inclusive,
describing its role as the probability of keeping an edge during BFS. Add a
validation check at the start of the function to ensure q is within this range,
and if not, handle the error appropriately, such as by returning a default value
or panicking with a clear message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
83-92
: Address past review feedback and fix parameter documentation.This duplicates the previous review comment about parameter validation and documentation. The
q
parameter still lacks proper validation and clear documentation.
🧹 Nitpick comments (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
99-101
: Consider memory efficiency for sparse point ID ranges.Creating bit vectors with size
max_point_id + 1
could be memory-intensive if point IDs are sparse (e.g., if points are [1, 1000000], it allocates for 1M+ elements but only uses 2).Consider using a
HashSet<PointOffsetType>
instead ofBitVec
for sparse point ID ranges, or add a comment explaining the memory trade-off assumption.
98-98
: Add safety comment for unwrap usage.While the
unwrap()
is safe due to the empty check at line 93, adding a comment would improve code clarity.- let max_point_id = *points.iter().max().unwrap(); + let max_point_id = *points.iter().max().unwrap(); // Safe: points is not empty
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: lint
🔇 Additional comments (2)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
7-7
: LGTM: Import addition supports bit vector operations.The new import
use common::ext::BitSliceExt;
is needed for theget_bit
method calls used in thesubgraph_connectivity
method.
75-76
: LGTM: Well-documented constant for search budget.The constant
SUBGRAPH_CONNECTIVITY_SEARCH_BUDGET
is appropriately named and documented for controlling the minimum search effort in connectivity estimation.
* refactor build_filtered_graph * utils * fmt * implement connectivity estimation * Improve connectivity check in HNSW index (#6571) * Use BitVec::fill instead of our custom function
H&M benchmark: https://github.com/qdrant/vector-db-benchmark/actions/runs/14785548595
ToDo:
More benchmarks in #6571