Skip to content

Conversation

generall
Copy link
Member

Apply the same under-sampling as we do for segments, but on the shard level.
Motivation: large-scale deployments with high-limit queries could benefit from this optimization, as data transfer between shards becoming a bottleneck.

Some considerations:

  • Unlike segments, shards are guaranteed to be randomly and independently distributed
  • Re-requests in shards are too expensive, so we don't make them
  • Without this optimization, high limit queries can not efficiently scale horizontally

Copy link
Contributor

coderabbitai bot commented Feb 24, 2025

📝 Walkthrough

Walkthrough

The changes introduce two new constants and a method in the collection query logic to manage undersampling when high limits are requested across multiple shards. The method checks conditions such as the number of shards and auto-sharding, adjusting batch query requests using a probabilistic sampling function if necessary. Additionally, the visibility of a module related to probabilistic search sampling has been updated from private to public, and the corresponding import path in the segments searcher file has been modified to reflect the renamed module.

Changes

File(s) Change Summary
lib/.../collection/query.rs Added constants SHARD_QUERY_SUBSAMPLING_LIMIT and MORE_ENSURANCE_FACTOR, and new method modify_shard_query_for_undersampling_limits to adjust batch requests for undersampling. Integrated this method into batch_query_shards_concurrently.
lib/.../collection_manager/mod.rs
lib/.../collection_manager/segments_searcher.rs
Updated module declaration from probabilistic_segment_search_sampling (private) to probabilistic_search_sampling (public) and changed the import path for find_search_sampling_over_point_distribution accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Collection as Collection
    participant Modifier as modify_shard_query_for_undersampling_limits
    participant Shards as Shard Queries

    Client->>Collection: batch_query_shards_concurrently(batch_request, num_shards, auto_sharding)
    Collection->>Modifier: Invoke modify_shard_query_for_undersampling_limits(...)
    Modifier-->>Collection: Return modified batch request
    Collection->>Shards: Dispatch modified queries concurrently
Loading
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Comment on lines +153 to +157
let batch_request = Self::modify_shard_query_for_undersampling_limits(
batch_request,
target_shards.len(),
is_auto_sharding,
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to find a better place for this modification, but was not able to, because:

  • Higher level functions require to know original limit and offset of the query for proper merge operation, so we can't modify original query higher in the call stack
  • Modification of the limit lower in the call stack also problematic, as we need to know "context" which depends on the shard selection. And we should not know about shard selector inside a shard.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)

64-64: Consider making the subsampling limit configurable.

SHARD_QUERY_SUBSAMPLING_LIMIT is currently a constant (128). If different deployments require tuning for optimal performance, consider making it a parameter in configuration or environment variables, rather than a hard-coded constant.


143-148: Check potential overhead of counting unique shard keys.

Using iter().unique().count() is appropriate here but introduces a small overhead. Since this runs frequently, ensure performance is acceptable for large numbers of shards.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ec07b6 and ef1c6ca.

📒 Files selected for processing (3)
  • lib/collection/src/collection/query.rs (3 hunks)
  • lib/collection/src/collection_manager/mod.rs (1 hunks)
  • lib/collection/src/collection_manager/segments_searcher.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • lib/collection/src/collection_manager/segments_searcher.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
🔇 Additional comments (4)
lib/collection/src/collection_manager/mod.rs (1)

6-6: Make sure public module visibility is intentional.

Changing the module from private to public allows external crates or modules to directly reference its contents. Verify that this increased visibility is desired, and consider adding module-level documentation for clarity.

lib/collection/src/collection/query.rs (3)

18-18: Consistent import path rename.

Importing find_search_sampling_over_point_distribution from the renamed probabilistic_search_sampling module aligns with the new public module declaration.


66-128: Ensure offset handling remains correct under subsampling.

This function resets the offset to zero, assuming offsets are handled at the collection level. Confirm that the final result set accurately respects user-specified offsets after merging shards.


151-158: Auto-sharding logic clarity.

is_auto_sharding is set to true when there is exactly one unique shard key. This condition works for typical auto-sharding scenarios, but ensure it also covers corner cases (e.g., partial user-specified shard subsets).

@generall
Copy link
Member Author

generall commented Feb 24, 2025

How to test:

create collection with 3 shards (can be local) and 10000 points. Disable hnsw index.

Compare exact search vs non-exact search with limit > 128, the difference should be within 0.999 (1 error per 100 requests with 10 shards distribution)

image

I am not sure how and if to build an automated test, since it would require move high volumes of fixture data

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)

120-120: Consider making the ensurance factor a constant.

The magic number 1.2 would be better as a named constant to improve maintainability and make it easier to tune if needed.

+    const MORE_ENSURANCE_FACTOR: f64 = 1.2;
-    let more_ensurance_factor = 1.2;

152-160: Consider extracting auto-sharding detection logic.

The auto-sharding detection logic and its explanation in comments would be clearer as a separate method with a descriptive name.

+    fn is_auto_sharding_enabled(target_shards: &[(Arc<Shard>, Option<ShardKey>)]) -> bool {
+        // Auto-sharding happens when we are only querying shards with the same shard key.
+        // This occurs either when:
+        // 1. Querying a specific shard key, or
+        // 2. Querying all shards with no shard keys specified
+        target_shards
+            .iter()
+            .map(|(_, shard_key)| shard_key)
+            .unique()
+            .count() == 1
+    }
+
     let num_unique_shard_keys = target_shards
-        .iter()
-        .map(|(_, shard_key)| shard_key)
-        .unique()
-        .count();
-    // Auto-sharding happens when we are only querying shards with _the_same_ shard key.
-    // It either might be when we are querying a specific shard key
-    // OR when we are querying all shards with no shard keys specified.
-    let is_auto_sharding = num_unique_shard_keys == 1;
+    let is_auto_sharding = Self::is_auto_sharding_enabled(&target_shards);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef1c6ca and 1afba85.

📒 Files selected for processing (1)
  • lib/collection/src/collection/query.rs (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
🔇 Additional comments (4)
lib/collection/src/collection/query.rs (4)

18-18: LGTM!

The import is correctly placed and aligns with the PR's objective of implementing shard-level limit undersampling.


66-78: LGTM! Clear and well-documented rationale.

The documentation clearly explains the optimization strategy and its benefits over segment-level sampling, particularly noting the guaranteed random distribution in auto-sharding.


131-133:

✅ Verification successful

Verify offset handling implications.

The comment states that "Offset is handled on the collection level", but we should verify this doesn't affect result accuracy:

  1. Does resetting offset to 0 maintain result ordering?
  2. Could this affect pagination in edge cases?

🏁 Script executed:

#!/bin/bash
# Search for offset handling in collection-level code
rg -A 5 "offset.*collection level" --type rust

Length of output: 1685


Offset Handling Implementation Validated

The investigation confirms that resetting new_request.offset to 0 in lib/collection/src/collection/query.rs is intentional. The code in lib/collection/src/operations/universal_query/planned_query.rs further clarifies that final offset adjustments occur at the collection level. This design ensures that result ordering and pagination are preserved without adverse side effects.


64-64:

❓ Verification inconclusive

Verify the subsampling limit threshold.

The threshold of 128 seems reasonable, but it would be helpful to validate this choice:

  1. Is this value derived from empirical testing?
  2. How does it compare to similar thresholds in segment-level sampling?

🌐 Web query:

What is a typical threshold for query result sampling in distributed search systems?

💡 Result:

Threshold estimation for query result sampling in distributed search systems often involves trade-offs between accuracy, computational cost, and overestimate risks. Common approaches include:

1. Sample Rate Thresholds

  • Random sampling typically uses rates between 1%–10% of the collection[1][2]. For example:
    • A 1% sample might be used for preliminary estimates in large-scale systems[1]
    • Higher rates (5–10%) improve accuracy at increased computational cost[2][4]

2. Statistical Confidence Thresholds
Methods like QBS-PL (Query-Based Sampling with Predictive Likelihood) use improvement thresholds (ε) to determine when to stop sampling:

  • ε = 0.05–0.10 is common for judging sufficient convergence[4]
  • This means sampling continues until additional iterations improve estimates by <5–10%

3. Hybrid Quantile-Sampling Methods
Modern approaches combine:

  • Quantile thresholds (precomputed top-k scores for term combinations)[1][2]
  • Adaptive sampling budgets (e.g., retrieving 80% of target documents from sample prefixes)[1][2]

For top-k processing algorithms like WAND/MaxScore:

# Typical parameterization from research
sample_rate = 0.01  # 1% sampling
k_prime = max(100, int(1.5 * k))  # Adjusted sample target
epsilon = 0.05      # 5% improvement threshold[4]

Key considerations:

  • Lower thresholds (ε=0.01) increase accuracy but require 2–3× more samples[4]
  • Systems handling diverse queries often use dynamic thresholds based on query complexity[1][2]
  • State-of-the-art methods achieve mean underprediction fractions (MUF) of 0.15–0.25 compared to ideal thresholds[2]

Current research shows optimal results with hybrid approaches combining quantile-based lower bounds with controlled random sampling[2][4].

Citations:


Confirm the rationale behind the subsampling threshold.

The constant SHARD_QUERY_SUBSAMPLING_LIMIT = 128 is set as an absolute limit for when undersampling is applied. However, typical distributed search systems often base such thresholds on dynamic, percentage-based, or empirically derived values—with sample rates usually ranging from 1% to 10% or using adaptive mechanisms for convergence. Please clarify:

  • Is the value 128 derived from empirical testing or performance benchmarks?
  • How does this fixed threshold compare to similar values (or adaptive thresholds) used in segment-level sampling?
  • Would a dynamic or configurable threshold be more appropriate given the trade-offs between accuracy and computational cost?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)

73-139: Well-structured implementation with room for documentation improvements.

The implementation is solid with appropriate early returns and clear logic flow. Consider these documentation improvements:

 /// Unlike segments, however, the cost of re-requesting the data is much higher for shards.
-/// So we "accept" the risk of not getting enough results.
+/// So we accept a small probability of getting fewer results than requested,
+/// as the cost of re-requesting data from shards outweighs the benefit of perfect accuracy.
 
-/// Example: 1000 limit, 10 shards
-/// 1.0 / 10 * 1.2 = 0.12
-/// lambda = 0.12 * 1000 = 120
-/// Which is equal to 171 limit per shard
+/// Example calculation:
+/// - Input: limit = 1000, num_shards = 10
+/// - Base probability: 1.0 / num_shards = 0.1
+/// - Adjusted probability: 0.1 * MORE_ENSURANCE_FACTOR = 0.12
+/// - Lambda: 0.12 * limit = 120
+/// - Result: undersample_limit = 171 per shard
+/// This ensures we get enough results while reducing data transfer.
 
+/// Note: Offset is handled at the collection level to ensure correct
+/// global ordering of results across all shards.

154-168: Consider clarifying the auto-sharding detection logic.

The auto-sharding detection is correct but could benefit from more explicit documentation.

-        // Auto-sharding happens when we are only querying shards with _the_same_ shard key.
-        // It either might be when we are querying a specific shard key
-        // OR when we are querying all shards with no shard keys specified.
+        // Auto-sharding is detected when all target shards have the same shard key.
+        // This occurs in two scenarios:
+        // 1. When explicitly querying a specific shard key
+        // 2. When querying all shards with no shard key specified (default case)
+        // In both cases, we can safely apply undersampling as the data distribution
+        // is guaranteed to be random across these shards.
         let is_auto_sharding = num_unique_shard_keys == 1;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1afba85 and 46f2379.

📒 Files selected for processing (1)
  • lib/collection/src/collection/query.rs (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (macos-latest)
  • GitHub Check: test
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (2)
lib/collection/src/collection/query.rs (2)

18-18: LGTM!

The import is correctly placed and aligns with the PR's objective of implementing shard-level undersampling.


64-71: LGTM! Constants are well-chosen and documented.

The values align with the PR objectives:

  • SHARD_QUERY_SUBSAMPLING_LIMIT = 128 matches the testing threshold mentioned in the PR
  • MORE_ENSURANCE_FACTOR = 1.2 provides a reasonable 20% safety margin

Please verify these constants through the testing approach mentioned in the PR:

  1. Create a collection with 3 shards and 10,000 points
  2. Compare exact vs non-exact search results with limit > 128
  3. Confirm difference is within 0.999 (max 1 error per 100 requests)

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two high level questions:

Comment on lines +84 to +85
/// Unlike segments, however, the cost of re-requesting the data is much higher for shards.
/// So we "accept" the risk of not getting enough results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's so unlikely, can we still do it in case we don't get enough results?

I imagine a latency spike to be a bit better than getting less results. Getting less results might make the user incorrectly think they have 'reached the end' or 'exhausted all'.

If we basically never expect this to happen, a latency spike might be a 'safer' option.

Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, it is a bit annoying to implement with our batching structure, would be at least twice as many changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. And, a user would still be able to use exact=true to bypass this.


for request in batch_request.iter() {
let mut new_request = request.clone();
let request_limit = new_request.limit + new_request.offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not go lower than the offset, to not destabilize results before our offset?

I understand that this wasn't 100% accurate or deterministic before, but I wonder if this might make it worse, amplifying inaccuracies, making the results all over the place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compensate the accuracy user might increase hnsw_ef, it will have the same effect as higher limits, but won't cause network/IO saturation as bad as limit

@agourlay
Copy link
Member

Do you maybe have some performance numbers to share based on a large realistic test deployment?

@generall
Copy link
Member Author

Do you maybe have some performance numbers to share based on a large realistic test deployment?

yes, 10 machines cluster with 10k limit takes 2 seconds per request, with 1k limit it takes 0.1 second, everything in RAM (including payload)

@@ -60,6 +61,83 @@ impl Collection {
Ok(results.into_iter().next().unwrap())
}

/// If the query limit above this value, it will be a subject to undersampling.
const SHARD_QUERY_SUBSAMPLING_LIMIT: usize = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specifically this number? For segments we use params.hnsw_ef

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no specific reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hnsw_ef was right choise for segment, cause we wanted to minimize HNSW overhead for large limits. This is not relevant for shard-level, where bottleneck is the networking

@generall
Copy link
Member Author

@agourlay added a debug message

@generall generall merged commit ff9a21f into dev Feb 26, 2025
17 checks passed
@generall generall deleted the poisson-shard-search branch February 26, 2025 09:14
timvisee pushed a commit that referenced this pull request Mar 21, 2025
* poisson-based undersampling for shard level search

* fmt

* give a bit more ensurance to undersampling

* review fixes

* add debug log in case undersampling can lead to accuracy loss
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants