-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Shard-level limit undersampling #6056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe changes introduce two new constants and a method in the collection query logic to manage undersampling when high limits are requested across multiple shards. The method checks conditions such as the number of shards and auto-sharding, adjusting batch query requests using a probabilistic sampling function if necessary. Additionally, the visibility of a module related to probabilistic search sampling has been updated from private to public, and the corresponding import path in the segments searcher file has been modified to reflect the renamed module. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Collection as Collection
participant Modifier as modify_shard_query_for_undersampling_limits
participant Shards as Shard Queries
Client->>Collection: batch_query_shards_concurrently(batch_request, num_shards, auto_sharding)
Collection->>Modifier: Invoke modify_shard_query_for_undersampling_limits(...)
Modifier-->>Collection: Return modified batch request
Collection->>Shards: Dispatch modified queries concurrently
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
let batch_request = Self::modify_shard_query_for_undersampling_limits( | ||
batch_request, | ||
target_shards.len(), | ||
is_auto_sharding, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find a better place for this modification, but was not able to, because:
- Higher level functions require to know original limit and offset of the query for proper merge operation, so we can't modify original query higher in the call stack
- Modification of the limit lower in the call stack also problematic, as we need to know "context" which depends on the shard selection. And we should not know about shard selector inside a shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)
64-64
: Consider making the subsampling limit configurable.
SHARD_QUERY_SUBSAMPLING_LIMIT
is currently a constant (128). If different deployments require tuning for optimal performance, consider making it a parameter in configuration or environment variables, rather than a hard-coded constant.
143-148
: Check potential overhead of counting unique shard keys.Using
iter().unique().count()
is appropriate here but introduces a small overhead. Since this runs frequently, ensure performance is acceptable for large numbers of shards.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/collection/src/collection/query.rs
(3 hunks)lib/collection/src/collection_manager/mod.rs
(1 hunks)lib/collection/src/collection_manager/segments_searcher.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- lib/collection/src/collection_manager/segments_searcher.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test
- GitHub Check: test (macos-latest)
- GitHub Check: test-consistency
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test
🔇 Additional comments (4)
lib/collection/src/collection_manager/mod.rs (1)
6-6
: Make sure public module visibility is intentional.Changing the module from private to public allows external crates or modules to directly reference its contents. Verify that this increased visibility is desired, and consider adding module-level documentation for clarity.
lib/collection/src/collection/query.rs (3)
18-18
: Consistent import path rename.Importing
find_search_sampling_over_point_distribution
from the renamedprobabilistic_search_sampling
module aligns with the new public module declaration.
66-128
: Ensure offset handling remains correct under subsampling.This function resets the
offset
to zero, assuming offsets are handled at the collection level. Confirm that the final result set accurately respects user-specified offsets after merging shards.
151-158
: Auto-sharding logic clarity.
is_auto_sharding
is set totrue
when there is exactly one unique shard key. This condition works for typical auto-sharding scenarios, but ensure it also covers corner cases (e.g., partial user-specified shard subsets).
How to test: create collection with 3 shards (can be local) and 10000 points. Disable hnsw index. Compare exact search vs non-exact search with limit > 128, the difference should be within 0.999 (1 error per 100 requests with 10 shards distribution) I am not sure how and if to build an automated test, since it would require move high volumes of fixture data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)
120-120
: Consider making the ensurance factor a constant.The magic number 1.2 would be better as a named constant to improve maintainability and make it easier to tune if needed.
+ const MORE_ENSURANCE_FACTOR: f64 = 1.2; - let more_ensurance_factor = 1.2;
152-160
: Consider extracting auto-sharding detection logic.The auto-sharding detection logic and its explanation in comments would be clearer as a separate method with a descriptive name.
+ fn is_auto_sharding_enabled(target_shards: &[(Arc<Shard>, Option<ShardKey>)]) -> bool { + // Auto-sharding happens when we are only querying shards with the same shard key. + // This occurs either when: + // 1. Querying a specific shard key, or + // 2. Querying all shards with no shard keys specified + target_shards + .iter() + .map(|(_, shard_key)| shard_key) + .unique() + .count() == 1 + } + let num_unique_shard_keys = target_shards - .iter() - .map(|(_, shard_key)| shard_key) - .unique() - .count(); - // Auto-sharding happens when we are only querying shards with _the_same_ shard key. - // It either might be when we are querying a specific shard key - // OR when we are querying all shards with no shard keys specified. - let is_auto_sharding = num_unique_shard_keys == 1; + let is_auto_sharding = Self::is_auto_sharding_enabled(&target_shards);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/collection/src/collection/query.rs
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
🔇 Additional comments (4)
lib/collection/src/collection/query.rs (4)
18-18
: LGTM!The import is correctly placed and aligns with the PR's objective of implementing shard-level limit undersampling.
66-78
: LGTM! Clear and well-documented rationale.The documentation clearly explains the optimization strategy and its benefits over segment-level sampling, particularly noting the guaranteed random distribution in auto-sharding.
131-133
:✅ Verification successful
Verify offset handling implications.
The comment states that "Offset is handled on the collection level", but we should verify this doesn't affect result accuracy:
- Does resetting offset to 0 maintain result ordering?
- Could this affect pagination in edge cases?
🏁 Script executed:
#!/bin/bash # Search for offset handling in collection-level code rg -A 5 "offset.*collection level" --type rustLength of output: 1685
Offset Handling Implementation Validated
The investigation confirms that resetting
new_request.offset
to 0 inlib/collection/src/collection/query.rs
is intentional. The code inlib/collection/src/operations/universal_query/planned_query.rs
further clarifies that final offset adjustments occur at the collection level. This design ensures that result ordering and pagination are preserved without adverse side effects.
64-64
:❓ Verification inconclusive
Verify the subsampling limit threshold.
The threshold of 128 seems reasonable, but it would be helpful to validate this choice:
- Is this value derived from empirical testing?
- How does it compare to similar thresholds in segment-level sampling?
🌐 Web query:
What is a typical threshold for query result sampling in distributed search systems?
💡 Result:
Threshold estimation for query result sampling in distributed search systems often involves trade-offs between accuracy, computational cost, and overestimate risks. Common approaches include:
1. Sample Rate Thresholds
- Random sampling typically uses rates between 1%–10% of the collection[1][2]. For example:
- A 1% sample might be used for preliminary estimates in large-scale systems[1]
- Higher rates (5–10%) improve accuracy at increased computational cost[2][4]
2. Statistical Confidence Thresholds
Methods like QBS-PL (Query-Based Sampling with Predictive Likelihood) use improvement thresholds (ε) to determine when to stop sampling:
- ε = 0.05–0.10 is common for judging sufficient convergence[4]
- This means sampling continues until additional iterations improve estimates by <5–10%
3. Hybrid Quantile-Sampling Methods
Modern approaches combine:
- Quantile thresholds (precomputed top-k scores for term combinations)[1][2]
- Adaptive sampling budgets (e.g., retrieving 80% of target documents from sample prefixes)[1][2]
For top-k processing algorithms like WAND/MaxScore:
# Typical parameterization from research sample_rate = 0.01 # 1% sampling k_prime = max(100, int(1.5 * k)) # Adjusted sample target epsilon = 0.05 # 5% improvement threshold[4]Key considerations:
- Lower thresholds (ε=0.01) increase accuracy but require 2–3× more samples[4]
- Systems handling diverse queries often use dynamic thresholds based on query complexity[1][2]
- State-of-the-art methods achieve mean underprediction fractions (MUF) of 0.15–0.25 compared to ideal thresholds[2]
Current research shows optimal results with hybrid approaches combining quantile-based lower bounds with controlled random sampling[2][4].
Citations:
- 1: http://engineering.nyu.edu/~suel/papers/thresh-cikm20.pdf
- 2: https://arxiv.org/html/2412.10701v1
- 3: https://hal.science/hal-04373202/document
- 4: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=6fefbc6c9602c089dce4112f8abb859f9603fe33
- 5: https://community.splunk.com/t5/Monitoring-Splunk/Is-it-possible-to-get-a-quot-sample-quot-set-of-search-results/m-p/15697
- 6: https://www.ibm.com/docs/bg/SSEV5M_6.0.0/com.ibm.iis.gnm.searchingnames.doc/topics/gnr_nh_api_compparms_name_threshold.html
- 7: https://academic.oup.com/comjnl/article-pdf/34/6/551/984688/34-6-551.pdf
- 8: http://www.cs.iit.edu/~yee/classes/cs595f03/100903_3.html
Confirm the rationale behind the subsampling threshold.
The constant
SHARD_QUERY_SUBSAMPLING_LIMIT = 128
is set as an absolute limit for when undersampling is applied. However, typical distributed search systems often base such thresholds on dynamic, percentage-based, or empirically derived values—with sample rates usually ranging from 1% to 10% or using adaptive mechanisms for convergence. Please clarify:
- Is the value 128 derived from empirical testing or performance benchmarks?
- How does this fixed threshold compare to similar values (or adaptive thresholds) used in segment-level sampling?
- Would a dynamic or configurable threshold be more appropriate given the trade-offs between accuracy and computational cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/collection/src/collection/query.rs (2)
73-139
: Well-structured implementation with room for documentation improvements.The implementation is solid with appropriate early returns and clear logic flow. Consider these documentation improvements:
/// Unlike segments, however, the cost of re-requesting the data is much higher for shards. -/// So we "accept" the risk of not getting enough results. +/// So we accept a small probability of getting fewer results than requested, +/// as the cost of re-requesting data from shards outweighs the benefit of perfect accuracy. -/// Example: 1000 limit, 10 shards -/// 1.0 / 10 * 1.2 = 0.12 -/// lambda = 0.12 * 1000 = 120 -/// Which is equal to 171 limit per shard +/// Example calculation: +/// - Input: limit = 1000, num_shards = 10 +/// - Base probability: 1.0 / num_shards = 0.1 +/// - Adjusted probability: 0.1 * MORE_ENSURANCE_FACTOR = 0.12 +/// - Lambda: 0.12 * limit = 120 +/// - Result: undersample_limit = 171 per shard +/// This ensures we get enough results while reducing data transfer. +/// Note: Offset is handled at the collection level to ensure correct +/// global ordering of results across all shards.
154-168
: Consider clarifying the auto-sharding detection logic.The auto-sharding detection is correct but could benefit from more explicit documentation.
- // Auto-sharding happens when we are only querying shards with _the_same_ shard key. - // It either might be when we are querying a specific shard key - // OR when we are querying all shards with no shard keys specified. + // Auto-sharding is detected when all target shards have the same shard key. + // This occurs in two scenarios: + // 1. When explicitly querying a specific shard key + // 2. When querying all shards with no shard key specified (default case) + // In both cases, we can safely apply undersampling as the data distribution + // is guaranteed to be random across these shards. let is_auto_sharding = num_unique_shard_keys == 1;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/collection/src/collection/query.rs
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (2)
lib/collection/src/collection/query.rs (2)
18-18
: LGTM!The import is correctly placed and aligns with the PR's objective of implementing shard-level undersampling.
64-71
: LGTM! Constants are well-chosen and documented.The values align with the PR objectives:
SHARD_QUERY_SUBSAMPLING_LIMIT = 128
matches the testing threshold mentioned in the PRMORE_ENSURANCE_FACTOR = 1.2
provides a reasonable 20% safety marginPlease verify these constants through the testing approach mentioned in the PR:
- Create a collection with 3 shards and 10,000 points
- Compare exact vs non-exact search results with limit > 128
- Confirm difference is within 0.999 (max 1 error per 100 requests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two high level questions:
/// Unlike segments, however, the cost of re-requesting the data is much higher for shards. | ||
/// So we "accept" the risk of not getting enough results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's so unlikely, can we still do it in case we don't get enough results?
I imagine a latency spike to be a bit better than getting less results. Getting less results might make the user incorrectly think they have 'reached the end' or 'exhausted all'.
If we basically never expect this to happen, a latency spike might be a 'safer' option.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, it is a bit annoying to implement with our batching structure, would be at least twice as many changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. And, a user would still be able to use exact=true to bypass this.
|
||
for request in batch_request.iter() { | ||
let mut new_request = request.clone(); | ||
let request_limit = new_request.limit + new_request.offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not go lower than the offset, to not destabilize results before our offset?
I understand that this wasn't 100% accurate or deterministic before, but I wonder if this might make it worse, amplifying inaccuracies, making the results all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compensate the accuracy user might increase hnsw_ef, it will have the same effect as higher limits, but won't cause network/IO saturation as bad as limit
Do you maybe have some performance numbers to share based on a large realistic test deployment? |
yes, 10 machines cluster with 10k limit takes 2 seconds per request, with 1k limit it takes 0.1 second, everything in RAM (including payload) |
@@ -60,6 +61,83 @@ impl Collection { | |||
Ok(results.into_iter().next().unwrap()) | |||
} | |||
|
|||
/// If the query limit above this value, it will be a subject to undersampling. | |||
const SHARD_QUERY_SUBSAMPLING_LIMIT: usize = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why specifically this number? For segments we use params.hnsw_ef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no specific reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hnsw_ef
was right choise for segment, cause we wanted to minimize HNSW overhead for large limits. This is not relevant for shard-level, where bottleneck is the networking
@agourlay added a debug message |
* poisson-based undersampling for shard level search * fmt * give a bit more ensurance to undersampling * review fixes * add debug log in case undersampling can lead to accuracy loss
Apply the same under-sampling as we do for segments, but on the shard level.
Motivation: large-scale deployments with high-limit queries could benefit from this optimization, as data transfer between shards becoming a bottleneck.
Some considerations: