Skip to content

Conversation

generall
Copy link
Member

@generall generall commented Apr 15, 2025

Main idea: for automated tools like Cluster Manager it should be enough to work with level-3. Level-4 is quite expensive in terms of request size and should only be required for in-depth performance debug

  • allow shard states in anonymize telemetry (with hashed peer ids)
  • split shard and segments detail levels

  • Make sure cluster manager can use Level3 of telemetry

@generall generall marked this pull request as ready for review April 16, 2025 09:08
Copy link
Contributor

coderabbitai bot commented Apr 16, 2025

📝 Walkthrough

Walkthrough

This change set introduces significant enhancements and refactoring to the telemetry infrastructure across collection, shard, and segment layers. The telemetry detail system is expanded from three to five levels, enabling more granular telemetry data retrieval. Several telemetry schemas, including OperationDurationStatistics, LocalShardTelemetry, and OptimizerTelemetry, are updated to make certain fields optional and nullable, with conditional inclusion based on the detail level. New approximate size and count metrics are added to shard telemetry. Telemetry-related methods are reorganized into dedicated modules and extended with new methods for aggregated telemetry, optimization status, and size statistics across various shard types and the collection. Anonymization logic is improved by adding utilities for hashing and anonymizing collections keyed by u64. The OpenAPI schema and tests are updated accordingly, reflecting the new telemetry detail levels and fields. The /metrics endpoint now requests telemetry data at a higher detail level. Additionally, aggregated telemetry retrieval methods are added to the content manager. Overall, these changes enhance telemetry data richness, modularity, and privacy handling.

Suggested reviewers

  • generall
  • agourlay

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69faf8d and 839fb04.

📒 Files selected for processing (3)
  • docs/redoc/master/openapi.json (7 hunks)
  • lib/collection/src/collection/mod.rs (4 hunks)
  • lib/collection/src/telemetry.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/collection/src/collection/mod.rs
  • lib/collection/src/telemetry.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (7)
docs/redoc/master/openapi.json (7)

11401-11406: Add required init_time_ms to telemetry schema
This aligns the OpenAPI schema with the Rust CollectionTelemetry struct, which now declares init_time_ms as a mandatory field. Please verify that init_time_ms is correctly defined under properties with the appropriate type and description.


11419-11445: Make shard, transfer, resharding, and clean-task arrays nullable
By adding "nullable": true to these properties, the schema now allows JSON null or omission for shards, transfers, resharding, and shard_clean_tasks, matching the Rust Option types.


11573-11600: Introduce approximate size and count metrics for shards
The new fields (vectors_size_bytes, payloads_size_bytes, num_points, num_vectors) are defined as nullable, non-negative integers with clear descriptions. This correctly mirrors the recent Rust additions to LocalShardTelemetry.


12121-12127: Restrict required properties in OperationDurationStatistics
Updating "required" to only include count reflects the Rust change that makes other fields optional. This ensures the schema remains consistent with the updated OperationDurationStatistics struct.


12132-12137: Make fail_count nullable with non-negative constraint
Adding "minimum": 0 and "nullable": true for fail_count aligns with its conversion to Option<u64> in the Rust telemetry code.


12157-12162: Make total_duration_micros nullable with non-negative constraint
The "minimum": 0 and "nullable": true settings for total_duration_micros correctly correspond to its new Option status in Rust.


12219-12223: Allow optimizer log arrays to be null
By marking the tracker telemetry array as "nullable": true, the schema now permits omitting or nulling the log in OptimizerTelemetry, matching the Rust implementation where the log is optional.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
lib/collection/src/collection/mod.rs (1)

808-835: New method for aggregated telemetry improves efficiency and granularity.

The implementation of get_aggregated_telemetry_data effectively aggregates key metrics across all shards, providing a lightweight summary for scenarios where detailed shard-level telemetry isn't needed. The approach to find the maximum optimizer status (worst case) is a good way to represent the overall health.

However, consider adding some form of error handling for get_size_stats().await.num_vectors to handle potential failures more gracefully.

-            vectors += shard.get_size_stats().await.num_vectors;
+            if let Ok(stats) = shard.get_size_stats().await {
+                vectors += stats.num_vectors;
+            }
lib/collection/src/shards/replica_set/telemetry.rs (1)

1-49: Consider adding documentation comments to the public methods.

While the implementation is clear, adding documentation comments would help explain the purpose of each method and the meaning of the telemetry detail parameter.

+    /// Collects telemetry data from local and remote shards at the specified detail level.
+    /// Returns a ReplicaSetTelemetry structure containing the shard id, key, and replica states.
+    ///
+    /// # Arguments
+    /// * `detail` - Controls the level of detail included in the telemetry data
     pub(crate) async fn get_telemetry_data(&self, detail: TelemetryDetail) -> ReplicaSetTelemetry {

+    /// Retrieves the optimization status from the local shard if it exists.
+    /// Returns None if there is no local shard.
     pub(crate) async fn get_optimization_status(&self) -> Option<OptimizersStatus> {

+    /// Retrieves size statistics from the local shard.
+    /// Returns default SizeStats if there is no local shard.
     pub(crate) async fn get_size_stats(&self) -> SizeStats {
src/common/telemetry_ops/collections_telemetry.rs (1)

26-45: Consider extracting detail level threshold to a constant.

The detail level threshold (Level2) is hardcoded in the if condition. Consider extracting this to a named constant for better readability and easier maintenance.

+    // Threshold for collecting full telemetry data
+    const FULL_TELEMETRY_THRESHOLD: DetailsLevel = DetailsLevel::Level2;
+
     pub async fn collect(detail: TelemetryDetail, access: &Access, toc: &TableOfContent) -> Self {
         let number_of_collections = toc.all_collections(access).await.len();
         let collections = if detail.level >= DetailsLevel::Level1 {
-            let telemetry_data = if detail.level >= DetailsLevel::Level2 {
+            let telemetry_data = if detail.level >= Self::FULL_TELEMETRY_THRESHOLD {
                 toc.get_telemetry_data(detail, access)
                     .await
                     .into_iter()
lib/collection/src/shards/local_shard/telemetry.rs (2)

11-66: Good implementation of telemetry collection with a suggestion for optimization

The get_telemetry_data method properly collects telemetry information based on the requested detail level. However, there's a potential performance improvement opportunity.

You're acquiring a read lock on segments twice - once directly at line 13 and again through get_size_stats() at line 47. Consider refactoring to use a single lock acquisition:

pub fn get_telemetry_data(&self, detail: TelemetryDetail) -> LocalShardTelemetry {
    let segments_read_guard = self.segments.read();

    let segments: Vec<_> = if detail.level >= DetailsLevel::Level4 {
        segments_read_guard
            .iter()
            .map(|(_id, segment)| segment.get().read().get_telemetry_data(detail))
            .collect()
    } else {
        vec![]
    };

    let optimizer_status = match &segments_read_guard.optimizer_errors {
        None => OptimizersStatus::Ok,
        Some(error) => OptimizersStatus::Error(error.to_string()),
    };
    
+    // Calculate size stats while we have the lock
+    let mut size_stats = SizeStats::default();
+    for (_segment_id, segment) in segments_read_guard.iter() {
+        let segment_info = segment.get().read().info();
+        size_stats.num_vectors += segment_info.num_vectors;
+        size_stats.num_points += segment_info.num_points;
+        size_stats.vectors_size_bytes += segment_info.vectors_size_bytes;
+        size_stats.payloads_size_bytes += segment_info.payloads_size_bytes;
+    }
    
    drop(segments_read_guard);
    let optimizations = self
        .optimizers
        .iter()
        .map(|optimizer| {
            optimizer
                .get_telemetry_counter()
                .lock()
                .get_statistics(detail)
        })
        .fold(Default::default(), |acc, x| acc + x);

    let total_optimized_points = self.total_optimized_points.load(Ordering::Relaxed);

-    let SizeStats {
-        num_vectors,
-        vectors_size_bytes,
-        payloads_size_bytes,
-        num_points,
-    } = self.get_size_stats();
+    let SizeStats {
+        num_vectors,
+        vectors_size_bytes,
+        payloads_size_bytes,
+        num_points,
+    } = size_stats;

    LocalShardTelemetry {
        variant_name: None,
        status: None,
        total_optimized_points,
        vectors_size_bytes: Some(vectors_size_bytes),
        payloads_size_bytes: Some(payloads_size_bytes),
        num_points: Some(num_points),
        num_vectors: Some(num_vectors),
        segments,
        optimizations: OptimizerTelemetry {
            status: optimizer_status,
            optimizations,
            log: (detail.level >= DetailsLevel::Level4)
                .then(|| self.optimizers_log.lock().to_telemetry()),
        },
        async_scorer: Some(get_async_scorer()),
    }
}

61-62: Consider centralizing telemetry level constants

The detail level check is duplicated between segment collection and optimizer log inclusion.

To improve maintainability, consider defining a constant or moving this level check to a helper method:

-            log: (detail.level >= DetailsLevel::Level4)
-                .then(|| self.optimizers_log.lock().to_telemetry()),
+            log: should_include_detailed_logs(detail)
+                .then(|| self.optimizers_log.lock().to_telemetry()),

And add a helper method:

fn should_include_detailed_logs(detail: TelemetryDetail) -> bool {
    detail.level >= DetailsLevel::Level4
}

This way, if the level requirement changes in the future, you only need to update it in one place.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d0da2e and 2f1fad4.

📒 Files selected for processing (22)
  • docs/redoc/master/openapi.json (3 hunks)
  • lib/collection/src/collection/mod.rs (5 hunks)
  • lib/collection/src/shards/dummy_shard.rs (2 hunks)
  • lib/collection/src/shards/forward_proxy_shard.rs (3 hunks)
  • lib/collection/src/shards/local_shard/mod.rs (1 hunks)
  • lib/collection/src/shards/local_shard/telemetry.rs (1 hunks)
  • lib/collection/src/shards/proxy_shard.rs (3 hunks)
  • lib/collection/src/shards/queue_proxy_shard.rs (3 hunks)
  • lib/collection/src/shards/replica_set/mod.rs (3 hunks)
  • lib/collection/src/shards/replica_set/telemetry.rs (1 hunks)
  • lib/collection/src/shards/shard.rs (2 hunks)
  • lib/collection/src/shards/telemetry.rs (4 hunks)
  • lib/collection/src/telemetry.rs (2 hunks)
  • lib/common/common/src/types.rs (2 hunks)
  • lib/segment/src/common/anonymize.rs (1 hunks)
  • lib/segment/src/common/operation_time_statistics.rs (4 hunks)
  • lib/segment/src/types.rs (1 hunks)
  • lib/storage/src/content_manager/toc/mod.rs (2 hunks)
  • src/actix/api/service_api.rs (1 hunks)
  • src/common/metrics.rs (1 hunks)
  • src/common/telemetry_ops/collections_telemetry.rs (2 hunks)
  • tests/openapi/test_service.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (6)
tests/openapi/test_service.py (1)
tests/openapi/helpers/helpers.py (1)
  • request_with_validation (39-93)
lib/collection/src/shards/forward_proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
  • get_optimization_status (75-77)
  • get_size_stats (79-81)
lib/collection/src/shards/local_shard/telemetry.rs (2)
  • get_optimization_status (68-76)
  • get_size_stats (78-96)
lib/collection/src/shards/proxy_shard.rs (2)
  • get_optimization_status (151-153)
  • get_size_stats (155-157)
lib/collection/src/shards/replica_set/telemetry.rs (2)
  • get_optimization_status (33-38)
  • get_size_stats (40-48)
lib/collection/src/shards/queue_proxy_shard.rs (2)
  • get_optimization_status (191-195)
  • get_size_stats (197-199)
lib/collection/src/shards/shard.rs (2)
  • get_optimization_status (91-99)
  • get_size_stats (101-109)
lib/collection/src/shards/shard.rs (5)
lib/collection/src/shards/proxy_shard.rs (2)
  • get_optimization_status (151-153)
  • get_size_stats (155-157)
lib/collection/src/shards/forward_proxy_shard.rs (2)
  • get_optimization_status (348-350)
  • get_size_stats (352-354)
lib/collection/src/shards/queue_proxy_shard.rs (2)
  • get_optimization_status (191-195)
  • get_size_stats (197-199)
lib/collection/src/shards/dummy_shard.rs (2)
  • get_optimization_status (75-77)
  • get_size_stats (79-81)
lib/collection/src/shards/local_shard/telemetry.rs (2)
  • get_optimization_status (68-76)
  • get_size_stats (78-96)
lib/collection/src/shards/proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
  • get_optimization_status (75-77)
  • get_size_stats (79-81)
lib/collection/src/shards/local_shard/telemetry.rs (2)
  • get_optimization_status (68-76)
  • get_size_stats (78-96)
lib/collection/src/shards/forward_proxy_shard.rs (2)
  • get_optimization_status (348-350)
  • get_size_stats (352-354)
lib/collection/src/shards/replica_set/telemetry.rs (2)
  • get_optimization_status (33-38)
  • get_size_stats (40-48)
lib/collection/src/shards/queue_proxy_shard.rs (2)
  • get_optimization_status (191-195)
  • get_size_stats (197-199)
lib/collection/src/shards/shard.rs (2)
  • get_optimization_status (91-99)
  • get_size_stats (101-109)
lib/collection/src/shards/queue_proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
  • get_optimization_status (75-77)
  • get_size_stats (79-81)
lib/collection/src/shards/local_shard/telemetry.rs (2)
  • get_optimization_status (68-76)
  • get_size_stats (78-96)
lib/collection/src/shards/proxy_shard.rs (2)
  • get_optimization_status (151-153)
  • get_size_stats (155-157)
lib/collection/src/shards/forward_proxy_shard.rs (2)
  • get_optimization_status (348-350)
  • get_size_stats (352-354)
lib/collection/src/shards/replica_set/telemetry.rs (2)
  • get_optimization_status (33-38)
  • get_size_stats (40-48)
lib/collection/src/shards/shard.rs (2)
  • get_optimization_status (91-99)
  • get_size_stats (101-109)
lib/collection/src/shards/local_shard/telemetry.rs (8)
lib/segment/src/vector_storage/common.rs (1)
  • get_async_scorer (11-13)
lib/collection/src/shards/dummy_shard.rs (3)
  • get_telemetry_data (60-73)
  • get_optimization_status (75-77)
  • get_size_stats (79-81)
lib/collection/src/shards/proxy_shard.rs (3)
  • get_telemetry_data (147-149)
  • get_optimization_status (151-153)
  • get_size_stats (155-157)
lib/collection/src/shards/forward_proxy_shard.rs (3)
  • get_telemetry_data (344-346)
  • get_optimization_status (348-350)
  • get_size_stats (352-354)
lib/collection/src/collection/mod.rs (1)
  • get_telemetry_data (776-806)
lib/collection/src/shards/queue_proxy_shard.rs (3)
  • get_telemetry_data (185-189)
  • get_optimization_status (191-195)
  • get_size_stats (197-199)
lib/collection/src/shards/shard.rs (3)
  • get_telemetry_data (73-89)
  • get_optimization_status (91-99)
  • get_size_stats (101-109)
lib/common/common/src/types.rs (1)
  • default (68-73)
🔇 Additional comments (60)
lib/segment/src/types.rs (1)

421-427: Clean implementation of size statistics struct.

The SizeStats struct provides a standardized way to collect and report size-related metrics across different shard types. The fields chosen cover the essential metrics for monitoring vector storage usage.

lib/segment/src/common/anonymize.rs (2)

95-99: Good implementation of u64 hashing.

The hash_u64 function provides a clean way to anonymize u64 values (like peer IDs) using Rust's default hasher, which is appropriate for telemetry anonymization purposes.


101-111: Well-implemented anonymization for u64-keyed collections.

This function follows the pattern of existing anonymization utilities and enables anonymization of collections with u64 keys, which is needed for including hashed peer IDs in telemetry data while maintaining privacy.

src/actix/api/service_api.rs (1)

82-82: Increased telemetry detail level for metrics endpoint.

Changing from Level1 to Level3 aligns with the PR objective of providing more granular telemetry data. This will expose additional statistics including shard states with hashed peer IDs.

lib/collection/src/shards/forward_proxy_shard.rs (4)

15-15: Updated imports to include SizeStats.

Added SizeStats to the import list from segment types to support the new get_size_stats method.


29-30: Updated operation types imports.

Added OptimizersStatus to support the new get_optimization_status method and organized imports alphabetically.


348-350: Added method to expose optimization status.

This method properly delegates to the wrapped local shard, maintaining the proxy pattern and ensuring consistent behavior across different shard types.


352-354: Added method to expose size statistics.

This method properly delegates to the wrapped local shard, consistent with the implementation in other shard types like ProxyShard and QueueProxyShard.

lib/collection/src/shards/proxy_shard.rs (3)

17-17: New import for SizeStats looks good

The addition of SizeStats to the imports is necessary for the new get_size_stats method implementation.


31-32: Adding OptimizersStatus is correct and well-placed

The import of OptimizersStatus has been appropriately added to the existing imports block for operation types.


151-157: Implementation of telemetry methods follows delegation pattern correctly

The new methods get_optimization_status and get_size_stats properly delegate to the corresponding methods on the wrapped shard, maintaining the proxy pattern consistently. This matches the implementation pattern seen in other proxy shard types (ForwardProxyShard, QueueProxyShard) and ensures consistent telemetry data gathering across all shard types.

lib/collection/src/shards/shard.rs (3)

10-10: SizeStats import added correctly

The import for SizeStats is appropriately added to the existing segment types import.


15-15: OptimizersStatus import added appropriately

The import for OptimizersStatus is correctly added to the existing operations types import.


91-109: Shard enum methods follow consistent implementation pattern

The new methods get_optimization_status and get_size_stats correctly match on each shard variant and delegate to the corresponding method implementation. This follows the same pattern as other methods in the Shard enum (like get_telemetry_data), ensuring a unified interface across all shard types.

This implementation enables higher-level code to access optimization status and size statistics without needing to know the specific shard implementation, furthering the abstraction.

src/common/metrics.rs (1)

441-441: Correctly handles nullable total_duration_micros

This change adapts the metrics generation to handle the now-optional total_duration_micros field by using unwrap_or(0). This prevents potential panics if the field is None and aligns with similar changes for other duration fields seen in the code.

This reflects the structural changes to the OperationDurationStatistics where fields are now optional, likely to support more granular telemetry detail levels.

tests/openapi/test_service.py (2)

44-44: Test updated to verify avg_duration_micros field presence

The test now correctly verifies the presence of the avg_duration_micros field in the telemetry response, which aligns with the updated telemetry structure.


46-61: Good addition of detailed telemetry test

This new test properly validates telemetry at a higher detail level by using the details_level=10 parameter. It verifies the response contains expected data including collection information and request metrics.

The test provides good coverage for the new telemetry detail level functionality and ensures that expanded telemetry data is properly returned.

lib/storage/src/content_manager/toc/mod.rs (2)

29-29: Well-structured import addition.
No issues found with adding CollectionsAggregatedTelemetry; it aligns with the new usage further down.


479-491:

✅ Verification successful

Consider adding error handling for asynchronous calls.
Currently, the method does not handle any potential errors from collection.get_aggregated_telemetry_data(). If there's a scenario where it can fail, consider returning a Result<Vec<CollectionsAggregatedTelemetry>> or other form of error propagation. Otherwise, if it never fails, this is fine.


🏁 Script executed:

#!/bin/bash
# Searching for the definition to confirm if get_aggregated_telemetry_data() can fail.
rg -A 10 "async fn get_aggregated_telemetry_data"

Length of output: 1875


Refined Direction on Telemetry Error Handling

The verification shows that the collection.get_aggregated_telemetry_data() function (in lib/collection/src/collection/mod.rs) always returns a concrete CollectionsAggregatedTelemetry. It internally uses a fallback (unwrap_or(OptimizersStatus::Ok)) when awaiting shard.get_optimization_status(), which indicates that errors are intentionally absorbed rather than propagated. Given this design choice, additional error handling in the caller (i.e., in the method within lib/storage/src/content_manager/toc/mod.rs) isn’t necessary. However, it may be beneficial to document this design decision so that future maintainers understand the rationale.

lib/collection/src/shards/queue_proxy_shard.rs (4)

17-18: Expanded imports for queue proxy shard.
The newly included SizeStats import correlates with the subsequent usage in get_size_stats().


31-32: Importing necessary result and status types.
These added imports align with the newly introduced telemetry and optimization functionalities.


191-195: Delegated optimization status retrieval.
Forwarding the call to the underlying shard is consistent with the proxy design.


197-199: Delegated size statistics retrieval.
Straightforward proxy to the underlying local shard’s size stats.

lib/collection/src/shards/local_shard/mod.rs (2)

9-9: Telemetry module extraction.
Introducing mod telemetry; cleanly separates telemetry-related logic from the rest of the shard code for better maintainability.


15-15: Atomic concurrency utilities.
The addition of AtomicBool and AtomicUsize looks appropriate for concurrent shard workflows.

lib/collection/src/shards/replica_set/mod.rs (3)

7-7: Good move to encapsulate shard telemetry in a dedicated module.
This separation improves code organization and maintainability by isolating all telemetry-related functionality within telemetry.rs.


20-20: Importing the Anonymize trait looks correct.
As you continue developing anonymized telemetry, verify that all sensitive fields are protected and tested thoroughly.


1191-1193: Adding the Anonymize derive to ReplicaState is consistent with anonymization goals.
Ensure that you provide test coverage for all variant fields, confirming they’re properly handled.

lib/collection/src/telemetry.rs (3)

10-10: New types OptimizersStatus, ReshardingInfo, and ShardTransferInfo are appropriately imported.
No issues spotted with this extension of telemetry data references.


25-25: Verify necessity of removing the default anonymization for shard_clean_tasks.
Double-check whether these task details may contain sensitive or identifiable information. If so, consider preserving or adjusting anonymization.


29-34: CollectionsAggregatedTelemetry struct is a clear and concise addition.
Its fields align with the goals of aggregated telemetry compliance, and the usage of Anonymize is consistent.

lib/common/common/src/types.rs (5)

40-43: Doc comments for Level0 enhance clarity on minimal telemetry.
The added explanations help users understand what data is included at this level.


45-52: Expanded documentation for Level1.
Nicely clarifies the scope of common info that is captured at this level.


54-58: Added commentary for Level2 underscores consensus details.
This change is consistent with capturing more advanced cluster and collection configuration info.


59-64: Introducing Level3 and Level4 with descriptive doc comments.
Helpful for explicitly distinguishing advanced telemetry, such as optimizer info and segment-level insights.


81-84: Extended mapping from usize to DetailsLevel is valid.
Correctly defaults all values above 4 to Level4, ensuring graceful handling of higher user inputs.

lib/collection/src/collection/mod.rs (2)

56-58: Imports refactored for enhanced telemetry capabilities.

Good organization of telemetry-related imports, which helps maintain clarity as telemetry features expand.


777-787: Improved telemetry data handling with conditional collection.

The conditional collection of shard telemetry based on the detail level is an excellent optimization. By only collecting detailed shard telemetry when necessary (level >= Level3), you reduce unnecessary processing and data transfer for lower detail levels.

lib/collection/src/shards/dummy_shard.rs (3)

13-15: Updated imports to support telemetry enhancements.

Properly updated imports to include SizeStats which aligns with the telemetry interface enhancements.


60-73: Extended telemetry data structure with optional fields.

The LocalShardTelemetry structure has been enhanced with new optional fields for more detailed telemetry. The use of None for these values in the dummy shard implementation is appropriate as it correctly indicates that these statistics aren't available for this type of shard.


75-82: Consistent telemetry interface across shard types.

Adding the get_optimization_status and get_size_stats methods to the DummyShard ensures a consistent interface across all shard types, which simplifies the collection layer's telemetry aggregation logic. The default return values are appropriate for this dummy implementation.

The methods correctly align with the interface expected by the new get_aggregated_telemetry_data method in the Collection struct.

docs/redoc/master/openapi.json (2)

12095-12133: Correctly made total_duration_micros optional and nullable in OperationDurationStatistics.

The field total_duration_micros is now nullable and not required, matching the Rust struct's use of Option<u64>. This change is correct and ensures the OpenAPI schema accurately reflects the backend data model. No issues found.


12189-12195: Correctly made log optional and nullable in OptimizerTelemetry.

The log field is now nullable and not required, aligning with the Rust struct's use of Option<Vec<...>>. This is a correct and necessary change for accurate API documentation and client compatibility.

lib/collection/src/shards/replica_set/telemetry.rs (4)

1-7: Imports look good.

The imports are well-organized and include all necessary dependencies for the telemetry implementation.


8-31: Implementation of get_telemetry_data is well structured.

The implementation efficiently collects telemetry data from both local and remote shards, properly handling the case where local shard might be absent. The code correctly uses async/await patterns and thread-safe access through read locks.


33-38: LGTM: Concise implementation for retrieving optimization status.

This method elegantly maps the Option through a closure to retrieve the optimization status only when a local shard exists.


40-48: Good handling of default case in get_size_stats.

The method correctly retrieves size statistics from the local shard when available, or returns default SizeStats when no local shard exists, which is a robust approach.

lib/segment/src/common/operation_time_statistics.rs (5)

5-6: Good update to imports for detail level.

The import of Level1 from DetailsLevel enables cleaner code when checking telemetry detail levels.


43-45: Appropriate change to make total_duration_micros optional.

Making this field optional with serialization skipping if None allows for more control over telemetry detail levels.


137-142: Well-implemented handling of optional total_duration_micros in Add trait.

The pattern matching handles all possible combinations of the optional values correctly.


300-310: Good implementation of conditional field inclusion based on detail level.

The code now conditionally includes telemetry fields based on the detail level, using the then_some and flatten pattern effectively. This is a clean approach to implementing the detail levels.


43-45:

✅ Verification successful

Verify backward compatibility with API consumers.

The change from required u64 to optional Option<u64> for total_duration_micros might affect consumers of this API. Ensure that any code directly accessing this field has been updated to handle the Option type.

Also applies to: 137-142, 305-310


🏁 Script executed:

#!/bin/bash
# Search for direct references to total_duration_micros field
rg -A 2 "\.total_duration_micros"

Length of output: 591


API Consumer Backward Compatibility Verified

The recent changes converting total_duration_micros from a required u64 to an optional Option<u64> have been internally updated. Our search confirms the following:

  • In src/common/metrics.rs:
    The field is accessed via stat.total_duration_micros.unwrap_or(0), which correctly defaults to 0 when the value is None.

  • In lib/segment/src/common/operation_time_statistics.rs:
    The code now uses pattern matching on (self.total_duration_micros, other.total_duration_micros) to handle all Option cases appropriately.

While these updates ensure proper handling within the codebase, please ensure that any external API consumers are aware of the change in serialization (i.e., the absence of the key when the value is None) and have updated their integrations accordingly.

lib/collection/src/shards/telemetry.rs (4)

4-5: Improved anonymization import for better flexibility.

The explicit import of anonymize_collection_with_u64_hashable_key enables more specific anonymization of collections with u64 keys.


22-23: Enhanced anonymization for replica states.

Using anonymize_collection_with_u64_hashable_key for the replicate_states field provides better anonymization while still maintaining the ability to identify related peers by their hashed IDs.


43-60: Well-documented addition of size metrics to LocalShardTelemetry.

The new optional fields provide valuable size and count metrics with appropriate warnings about their approximate nature. The documentation clearly indicates these values should not be strictly relied upon.


71-72: Good change to make optimizer logs optional.

Making the log field optional in OptimizerTelemetry with serialization skipping if None helps reduce response size when detailed logs aren't needed.

src/common/telemetry_ops/collections_telemetry.rs (2)

1-1: Updated import to use new aggregated telemetry structure.

Importing CollectionsAggregatedTelemetry from the collection telemetry module is a good refactoring approach.


28-40: Improved conditional telemetry collection based on detail level.

The implementation now properly branches based on the telemetry detail level, using the appropriate method for each level. For Level2 and above, it collects full telemetry data, while for lower levels it fetches aggregated data. This approach is more efficient than always fetching full telemetry and then converting.

lib/collection/src/shards/local_shard/telemetry.rs (2)

68-76: Well-implemented optimizer status retrieval method

This method correctly abstracts the optimizer status logic that was previously embedded in the get_telemetry_data method. The lock handling is appropriate, acquiring the lock only for the duration needed to check the optimizer errors.


78-96: Good implementation of size statistics aggregation

The get_size_stats method efficiently aggregates size statistics across all segments. The lock handling is appropriate, and the method returns accumulated statistics correctly.

Comment on lines +40 to +64
/// Minimal information level
/// - app info
/// - minimal telemetry by endpoint
/// - cluster status
Level0,
/// Detailed common info level
/// - app info details
/// - system info
/// - hardware flags
/// - hardware usage per collection
/// - RAM usage
/// - cluster basic details
/// - collections basic info
Level1,
/// Detailed consensus info - peers info
/// Collections:
/// - detailed config
/// - Shards - basic config
Level2,
/// Shards:
/// - detailed config
/// - Optimizers info
Level3,
/// Segment level telemetry
Level4,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break-down is here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it also make sense to include the dev feature flags into Level1?

@timvisee timvisee self-requested a review April 17, 2025 14:07
Copy link
Contributor

@coszio coszio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only relevant change I'd like is to not output an empty list for "shards" when details_level < 3.

I made a PR (#6398) to fix this and the same for "resharding" and "transfers". I didn't commit directly in case we want to keep these two others as basic shards info.

Comment on lines +40 to +64
/// Minimal information level
/// - app info
/// - minimal telemetry by endpoint
/// - cluster status
Level0,
/// Detailed common info level
/// - app info details
/// - system info
/// - hardware flags
/// - hardware usage per collection
/// - RAM usage
/// - cluster basic details
/// - collections basic info
Level1,
/// Detailed consensus info - peers info
/// Collections:
/// - detailed config
/// - Shards - basic config
Level2,
/// Shards:
/// - detailed config
/// - Optimizers info
Level3,
/// Segment level telemetry
Level4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it also make sense to include the dev feature flags into Level1?

* skip serializing shard details in Level2

* upd openapi

---------

Co-authored-by: generall <andrey@vasnetsov.com>
@generall generall requested a review from coszio April 17, 2025 21:31
@generall generall merged commit c57d748 into dev Apr 17, 2025
17 checks passed
@generall generall deleted the telemetry-levels branch April 17, 2025 21:32
@KShivendu KShivendu mentioned this pull request Apr 18, 2025
3 tasks
pull bot pushed a commit to kp-forks/qdrant that referenced this pull request Apr 21, 2025
* allow shard states in anonymize telemetry (with hashed peer ids)

* introduce level3 and level4 for telemetry, where level4 = everything, incl. segments info

* upd openapi

* fix tests

* expose vector count & size stats on shard level to avoid reading of segments

* fix spelling

* upd schema

* fix tests

* Use unwrap_or_default

* [qdrant#6390] skip serializing shard details in Level2 (qdrant#6398)

* skip serializing shard details in Level2

* upd openapi

---------

Co-authored-by: generall <andrey@vasnetsov.com>

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants