-
Notifications
You must be signed in to change notification settings - Fork 1.8k
do not skip empty histogram values + remove too small buckets #7073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
lib/segment/src/common/operation_time_statistics.rs (2)
337-339
: Doc no longer matches behavior: convert_histogram now emits dense output.The function comment states “sparse” and “omit repeated values”, but the new logic emits a pair for every boundary (including zero-count ranges). Please update the docs to prevent confusion.
Apply this doc update:
-/// Convert a fixed-size non-cumulative histogram to a sparse cumulative histogram. -/// Omit repeated values to reduce the size of the histogram. +/// Convert a fixed-size non-cumulative histogram to a dense cumulative histogram. +/// Emit (boundary, cumulative_count) for every boundary, including zero-count ranges.
431-454
: Tests expect sparse behavior; update them to match dense output.With the new dense histogram behavior, these assertions will fail. Update expected values to include every boundary.
Apply the following updates:
#[test] fn test_convert_histogram() { // With all zeroes assert_eq!( - convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0], 0), - vec![], + convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0], 0), + vec![(0., 0), (1., 0), (2., 0), (3., 0), (4., 0), (5., 0)], ); // With all zeroes except the total count assert_eq!( - convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0], 100), - vec![(5., 0)], + convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0], 100), + vec![(0., 0), (1., 0), (2., 0), (3., 0), (4., 0), (5., 0)], ); // Full assert_eq!( convert_histogram(&[0., 1., 2., 3.], &[1, 20, 300, 4000], 5000), vec![(0., 1), (1., 21), (2., 321), (3., 4321)], ); // Sparse assert_eq!( - convert_histogram(&[0., 1., 2., 3., 4., 5., 6.], &[0, 0, 1, 0, 0, 1, 0], 1), - vec![(1.0, 0), (2.0, 1), (4.0, 1), (5.0, 2), (6.0, 2)], + convert_histogram(&[0., 1., 2., 3., 4., 5., 6.], &[0, 0, 1, 0, 0, 1, 0], 1), + vec![(0.0, 0), (1.0, 0), (2.0, 1), (3.0, 1), (4.0, 1), (5.0, 2), (6.0, 2)], ); }
🧹 Nitpick comments (3)
lib/segment/src/common/operation_time_statistics.rs (3)
344-366
: Clean up dead code and fix capacity estimation for dense histograms.
prev
is never set toSome
, so bothif let Some(prev)
branches are dead.- The capacity estimate is tuned for sparse output; with dense output it should be
le_boundaries.len()
to avoid reallocations.Apply this refactor to simplify and make capacity accurate:
-fn convert_histogram( - le_boundaries: &[f32], - counts: &[usize], - total_count: usize, -) -> Vec<(f32, usize)> { - let rough_len_estimation = std::cmp::min( - le_boundaries.len(), - counts.iter().filter(|&&c| c != 0).count() * 2, - ); - let mut result = Vec::with_capacity(rough_len_estimation); - let mut cumulative_count = 0; - let mut prev = None; - for (idx, &le) in le_boundaries.iter().enumerate() { - let count = counts.get(idx).copied().unwrap_or(0); - if let Some(prev) = prev { - result.push((prev, cumulative_count)); - } - cumulative_count += count; - result.push((le, cumulative_count)); - prev = None; - } - if let Some(prev) = prev - && cumulative_count != total_count - { - result.push((prev, cumulative_count)); - } - result -} +fn convert_histogram( + le_boundaries: &[f32], + counts: &[usize], + _total_count: usize, +) -> Vec<(f32, usize)> { + // Dense output: one (le, cumulative) pair per boundary. + let mut result = Vec::with_capacity(le_boundaries.len()); + let mut cumulative_count = 0; + for (idx, &le) in le_boundaries.iter().enumerate() { + let count = counts.get(idx).copied().unwrap_or(0); + cumulative_count += count; + result.push((le, cumulative_count)); + } + result +}
368-372
: Consider updating merge_histograms docs to drop “sparse”-only phrasing.The implementation works for dense histograms too. Updating the comment improves clarity and keeps it aligned with
convert_histogram
.-/// Merge two sparse cumulative histograms, summing the counts of the same boundaries. +/// Merge two cumulative histograms (dense or sparse), summing the counts of the same boundaries. /// If one boundary is missing in one of the vectors, assume its value to be the same as the next /// boundary in the same vector. NOTE: This assumption should be correct when merging histograms -/// produced by `convert_histogram` with the same set of boundaries, but it's not always the case. +/// produced by `convert_histogram` with the same set of boundaries, but it's not always the case.
282-295
: Drop the unused precomputation in get_statistics to avoid extra work.The block builds a dense histogram and then discards it in favor of
convert_histogram
. With the new dense behavior, you can just callconvert_histogram
directly.Apply this simplification:
- let duration_micros_histogram = if detail.histograms { - let mut duration_micros_histogram = - Vec::with_capacity(DEFAULT_BUCKET_BOUNDARIES_MICROS.len()); - let mut cumulative_count = 0; - for (&count, &le) in self.buckets.iter().zip(&DEFAULT_BUCKET_BOUNDARIES_MICROS) { - cumulative_count += count; - duration_micros_histogram.push((le, cumulative_count)); - } - convert_histogram( - &DEFAULT_BUCKET_BOUNDARIES_MICROS, - &self.buckets, - self.ok_count, - ) - } else { - Vec::new() - }; + let duration_micros_histogram = if detail.histograms { + convert_histogram( + &DEFAULT_BUCKET_BOUNDARIES_MICROS, + &self.buckets, + self.ok_count, + ) + } else { + Vec::new() + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/segment/src/common/operation_time_statistics.rs
(2 hunks)
🔇 Additional comments (1)
lib/segment/src/common/operation_time_statistics.rs (1)
77-95
: Bucket boundaries reduction verified—no downstream issues foundAll references to
DEFAULT_BUCKET_BOUNDARIES_MICROS
use its.len()
, so no code or tests assume 16 buckets. The only residual hard-coded “16” is in the inline capacity:
- lib/segment/src/common/operation_time_statistics.rs:74
buckets: SmallVec<[usize; 16]>,This is harmless but can be tightened to
[usize; 14]
to match the new default and reduce inline storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
lib/segment/src/common/operation_time_statistics.rs (5)
279-287
: Remove dead/duplicated cumulative building; use convert_histogram onlyThe manual loop builds a cumulative histogram and then discards it by returning convert_histogram(...) instead. This wastes CPU and triggers an “unused variable” warning. Remove the redundant code.
- let mut duration_micros_histogram = - Vec::with_capacity(DEFAULT_BUCKET_BOUNDARIES_MICROS.len()); - let mut cumulative_count = 0; - for (&count, &le) in self.buckets.iter().zip(&DEFAULT_BUCKET_BOUNDARIES_MICROS) { - cumulative_count += count; - duration_micros_histogram.push((le, cumulative_count)); - } - convert_histogram(&DEFAULT_BUCKET_BOUNDARIES_MICROS, &self.buckets) + convert_histogram(&DEFAULT_BUCKET_BOUNDARIES_MICROS, &self.buckets)
329-333
: Clarify docstring: function now emits a dense cumulative histogramThe doc should explicitly state it emits (upper_boundary, cumulative_count) for every boundary.
-/// Convert a fixed-size non-cumulative histogram to a cumulative histogram. +/// Convert a fixed-size non-cumulative histogram to a dense cumulative histogram. +/// Emits (upper_boundary, cumulative_count) for every boundary in `le_boundaries`.
331-351
: Simplify convert_histogram; remove unusedprev
and correct pre-allocation
- prev is never set to Some(...), so the flush branches never execute.
- rough_len_estimation can under-allocate (e.g., all zeros), causing unnecessary reallocations. With dense output, we can pre-allocate exactly le_boundaries.len().
fn convert_histogram(le_boundaries: &[f32], counts: &[usize]) -> Vec<(f32, usize)> { - let rough_len_estimation = std::cmp::min( - le_boundaries.len(), - counts.iter().filter(|&&c| c != 0).count() * 2, - ); - let mut result = Vec::with_capacity(rough_len_estimation); - let mut cumulative_count = 0; - let mut prev = None; - for (idx, &le) in le_boundaries.iter().enumerate() { - let count = counts.get(idx).copied().unwrap_or(0); - if let Some(prev) = prev { - result.push((prev, cumulative_count)); - } - cumulative_count += count; - result.push((le, cumulative_count)); - prev = None; - } - if let Some(prev) = prev { - result.push((prev, cumulative_count)); - } - result + let mut result = Vec::with_capacity(le_boundaries.len()); + let mut cumulative_count = 0usize; + for (idx, &le) in le_boundaries.iter().enumerate() { + cumulative_count += counts.get(idx).copied().unwrap_or(0); + result.push((le, cumulative_count)); + } + result }
422-426
: Stale comment: “except the total count” no longer appliesThere is no total_count parameter anymore, and this block duplicates the previous case. Either update the comment or drop the duplicate.
- // With all zeroes except the total count + // With all zeroes (no total count parameter anymore)
59-73
: Public API change: bucket boundaries reduced from 16 → 11
Internal usage always derives bucket count viaDEFAULT_BUCKET_BOUNDARIES_MICROS.len()
(no hard-coded “16” checks were found). TheSmallVec<[usize; 16]>
inline capacity inOperationTimeStatistics
still ≥ 11, so it continues to work but leaves unused headroom.• Please confirm no downstream code relies on a 16-element histogram or specific bucket indices.
• Optional: align theSmallVec
capacity to 11 or add a comment explaining the extra space.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/segment/src/common/operation_time_statistics.rs
(6 hunks)
🔇 Additional comments (4)
lib/segment/src/common/operation_time_statistics.rs (4)
418-420
: LGTM: asserts dense output for the all-zero caseThe test now expects entries for every boundary even when counts are zero. Matches the new semantics.
430-431
: LGTM: “Full” case matches dense cumulative semanticsThe cumulative sums per boundary are correct.
436-446
: LGTM: “Sparse” input now yields dense cumulative outputGood coverage ensuring zero-count boundaries are included.
494-496
: LGTM: updated convert_histogram calls post-signature changeConsistent with the new signature and the merge test still ensures convert↔merge commutativity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
GET /metrics
# TYPE grpc_responses_duration_seconds histogram
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.001"} 0
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.005"} 565
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.01"} 865
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.05"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="+Inf"} 1000
After:
GET /metrics
# TYPE grpc_responses_duration_seconds histogram
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.001"} 0
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.005"} 764
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.01"} 984
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.02"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.05"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.1"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="0.5"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="1"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="5"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="10"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="50"} 1000
grpc_responses_duration_seconds_bucket{endpoint="/qdrant.Points/Upsert",le="+Inf"} 1000
LGTM. Suggested a small change.
// With all zeroes except the total count | ||
assert_eq!( | ||
convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0], 100), | ||
vec![(5., 0)], | ||
convert_histogram(&[0., 1., 2., 3., 4., 5.], &[0, 0, 0, 0, 0, 0]), | ||
vec![(0., 0), (1., 0), (2., 0), (3., 0), (4., 0), (5., 0)], | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no difference between this test case and the previous one. Please drop this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
lib/segment/src/common/operation_time_statistics.rs (1)
485-490
: Duplicate test case: drop or make it meaningfully distinctThe second and fourth
case(...)
invocations only differ intotal_a
(256 vs 0) while both inputs are produced byconvert_histogram
, which yields dense vectors of equal length. In this scenario,merge_histograms
never usestotal_*
, so both cases are equivalent. This was already raised; let's address it.Apply this diff to drop the duplicate:
case(&[00, 00, 00, 00, 00], &[86, 50, 47, 84, 52], 256, 319); case(&[00, 23, 00, 00, 00], &[00, 00, 00, 84, 00], 30, 90); - case(&[00, 00, 00, 00, 00], &[86, 50, 47, 84, 52], 0, 319);
Alternatively, if you want to exercise the
total_*
fallback path, feedmerge_histograms
with intentionally sparse cumulative vectors of different lengths (not viaconvert_histogram
).
🧹 Nitpick comments (2)
lib/segment/src/common/operation_time_statistics.rs (2)
279-287
: Avoid double work: remove the manual cumulative build and return convert_histogram directlyInside the histograms branch you precompute a cumulative histogram into a local
duration_micros_histogram
and then discard it in favor ofconvert_histogram(...)
. This extra work is unnecessary.Apply this diff:
let duration_micros_histogram = if detail.histograms { - let mut duration_micros_histogram = - Vec::with_capacity(DEFAULT_BUCKET_BOUNDARIES_MICROS.len()); - let mut cumulative_count = 0; - for (&count, &le) in self.buckets.iter().zip(&DEFAULT_BUCKET_BOUNDARIES_MICROS) { - cumulative_count += count; - duration_micros_histogram.push((le, cumulative_count)); - } convert_histogram(&DEFAULT_BUCKET_BOUNDARIES_MICROS, &self.buckets) } else { Vec::new() };
329-351
: Simplify convert_histogram and fix capacity underestimation
prev
is never set toSome
, so the related branch and the final flush are dead code.rough_len_estimation
underestimates capacity when many buckets are zero (e.g., all-zero input reserves 0 and reallocates len times). We already know the exact output length equalsle_boundaries.len()
.Apply this diff:
-/// Convert a fixed-size non-cumulative histogram to a cumulative histogram. -fn convert_histogram(le_boundaries: &[f32], counts: &[usize]) -> Vec<(f32, usize)> { - let rough_len_estimation = std::cmp::min( - le_boundaries.len(), - counts.iter().filter(|&&c| c != 0).count() * 2, - ); - let mut result = Vec::with_capacity(rough_len_estimation); - let mut cumulative_count = 0; - let mut prev = None; - for (idx, &le) in le_boundaries.iter().enumerate() { - let count = counts.get(idx).copied().unwrap_or(0); - if let Some(prev) = prev { - result.push((prev, cumulative_count)); - } - cumulative_count += count; - result.push((le, cumulative_count)); - prev = None; - } - if let Some(prev) = prev { - result.push((prev, cumulative_count)); - } - result -} +/// Convert a fixed-size non-cumulative histogram to a dense cumulative histogram. +fn convert_histogram(le_boundaries: &[f32], counts: &[usize]) -> Vec<(f32, usize)> { + let mut result = Vec::with_capacity(le_boundaries.len()); + let mut cumulative_count = 0; + for (idx, &le) in le_boundaries.iter().enumerate() { + cumulative_count += counts.get(idx).copied().unwrap_or(0); + result.push((le, cumulative_count)); + } + result +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/segment/src/common/operation_time_statistics.rs
(6 hunks)
🔇 Additional comments (2)
lib/segment/src/common/operation_time_statistics.rs (2)
414-441
: Tests validate dense semantics across zero/full/sparse inputs — LGTMThe updated expectations ensure we emit an entry per boundary (including zero-count buckets), matching the new behavior.
59-74
: DEFAULT_BUCKET_BOUNDARIES_MICROS length change verified
- No occurrences of
[f32; 16]
,len() == 16
, orSmallVec<[usize; 16]>
were found elsewhere in the repo.- All internal uses derive length via
DEFAULT_BUCKET_BOUNDARIES_MICROS.len()
, so the new size (11
) propagates dynamically.- The SmallVec inline capacity remains at
16
for headroom; you may optionally reduce it to11
to match the boundary array, but it’s not required.
* do not skip empty histogram values + remove too small buckets * fix tests and remove some more values from the histogram * remove redundant test
No description provided.