-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Telemetry improvements #6390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry improvements #6390
Conversation
📝 WalkthroughWalkthroughThis change set introduces significant enhancements and refactoring to the telemetry infrastructure across collection, shard, and segment layers. The telemetry detail system is expanded from three to five levels, enabling more granular telemetry data retrieval. Several telemetry schemas, including Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (13)
🔇 Additional comments (7)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
lib/collection/src/collection/mod.rs (1)
808-835
: New method for aggregated telemetry improves efficiency and granularity.The implementation of
get_aggregated_telemetry_data
effectively aggregates key metrics across all shards, providing a lightweight summary for scenarios where detailed shard-level telemetry isn't needed. The approach to find the maximum optimizer status (worst case) is a good way to represent the overall health.However, consider adding some form of error handling for
get_size_stats().await.num_vectors
to handle potential failures more gracefully.- vectors += shard.get_size_stats().await.num_vectors; + if let Ok(stats) = shard.get_size_stats().await { + vectors += stats.num_vectors; + }lib/collection/src/shards/replica_set/telemetry.rs (1)
1-49
: Consider adding documentation comments to the public methods.While the implementation is clear, adding documentation comments would help explain the purpose of each method and the meaning of the telemetry detail parameter.
+ /// Collects telemetry data from local and remote shards at the specified detail level. + /// Returns a ReplicaSetTelemetry structure containing the shard id, key, and replica states. + /// + /// # Arguments + /// * `detail` - Controls the level of detail included in the telemetry data pub(crate) async fn get_telemetry_data(&self, detail: TelemetryDetail) -> ReplicaSetTelemetry { + /// Retrieves the optimization status from the local shard if it exists. + /// Returns None if there is no local shard. pub(crate) async fn get_optimization_status(&self) -> Option<OptimizersStatus> { + /// Retrieves size statistics from the local shard. + /// Returns default SizeStats if there is no local shard. pub(crate) async fn get_size_stats(&self) -> SizeStats {src/common/telemetry_ops/collections_telemetry.rs (1)
26-45
: Consider extracting detail level threshold to a constant.The detail level threshold (Level2) is hardcoded in the if condition. Consider extracting this to a named constant for better readability and easier maintenance.
+ // Threshold for collecting full telemetry data + const FULL_TELEMETRY_THRESHOLD: DetailsLevel = DetailsLevel::Level2; + pub async fn collect(detail: TelemetryDetail, access: &Access, toc: &TableOfContent) -> Self { let number_of_collections = toc.all_collections(access).await.len(); let collections = if detail.level >= DetailsLevel::Level1 { - let telemetry_data = if detail.level >= DetailsLevel::Level2 { + let telemetry_data = if detail.level >= Self::FULL_TELEMETRY_THRESHOLD { toc.get_telemetry_data(detail, access) .await .into_iter()lib/collection/src/shards/local_shard/telemetry.rs (2)
11-66
: Good implementation of telemetry collection with a suggestion for optimizationThe
get_telemetry_data
method properly collects telemetry information based on the requested detail level. However, there's a potential performance improvement opportunity.You're acquiring a read lock on segments twice - once directly at line 13 and again through
get_size_stats()
at line 47. Consider refactoring to use a single lock acquisition:pub fn get_telemetry_data(&self, detail: TelemetryDetail) -> LocalShardTelemetry { let segments_read_guard = self.segments.read(); let segments: Vec<_> = if detail.level >= DetailsLevel::Level4 { segments_read_guard .iter() .map(|(_id, segment)| segment.get().read().get_telemetry_data(detail)) .collect() } else { vec![] }; let optimizer_status = match &segments_read_guard.optimizer_errors { None => OptimizersStatus::Ok, Some(error) => OptimizersStatus::Error(error.to_string()), }; + // Calculate size stats while we have the lock + let mut size_stats = SizeStats::default(); + for (_segment_id, segment) in segments_read_guard.iter() { + let segment_info = segment.get().read().info(); + size_stats.num_vectors += segment_info.num_vectors; + size_stats.num_points += segment_info.num_points; + size_stats.vectors_size_bytes += segment_info.vectors_size_bytes; + size_stats.payloads_size_bytes += segment_info.payloads_size_bytes; + } drop(segments_read_guard); let optimizations = self .optimizers .iter() .map(|optimizer| { optimizer .get_telemetry_counter() .lock() .get_statistics(detail) }) .fold(Default::default(), |acc, x| acc + x); let total_optimized_points = self.total_optimized_points.load(Ordering::Relaxed); - let SizeStats { - num_vectors, - vectors_size_bytes, - payloads_size_bytes, - num_points, - } = self.get_size_stats(); + let SizeStats { + num_vectors, + vectors_size_bytes, + payloads_size_bytes, + num_points, + } = size_stats; LocalShardTelemetry { variant_name: None, status: None, total_optimized_points, vectors_size_bytes: Some(vectors_size_bytes), payloads_size_bytes: Some(payloads_size_bytes), num_points: Some(num_points), num_vectors: Some(num_vectors), segments, optimizations: OptimizerTelemetry { status: optimizer_status, optimizations, log: (detail.level >= DetailsLevel::Level4) .then(|| self.optimizers_log.lock().to_telemetry()), }, async_scorer: Some(get_async_scorer()), } }
61-62
: Consider centralizing telemetry level constantsThe detail level check is duplicated between segment collection and optimizer log inclusion.
To improve maintainability, consider defining a constant or moving this level check to a helper method:
- log: (detail.level >= DetailsLevel::Level4) - .then(|| self.optimizers_log.lock().to_telemetry()), + log: should_include_detailed_logs(detail) + .then(|| self.optimizers_log.lock().to_telemetry()),And add a helper method:
fn should_include_detailed_logs(detail: TelemetryDetail) -> bool { detail.level >= DetailsLevel::Level4 }This way, if the level requirement changes in the future, you only need to update it in one place.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
docs/redoc/master/openapi.json
(3 hunks)lib/collection/src/collection/mod.rs
(5 hunks)lib/collection/src/shards/dummy_shard.rs
(2 hunks)lib/collection/src/shards/forward_proxy_shard.rs
(3 hunks)lib/collection/src/shards/local_shard/mod.rs
(1 hunks)lib/collection/src/shards/local_shard/telemetry.rs
(1 hunks)lib/collection/src/shards/proxy_shard.rs
(3 hunks)lib/collection/src/shards/queue_proxy_shard.rs
(3 hunks)lib/collection/src/shards/replica_set/mod.rs
(3 hunks)lib/collection/src/shards/replica_set/telemetry.rs
(1 hunks)lib/collection/src/shards/shard.rs
(2 hunks)lib/collection/src/shards/telemetry.rs
(4 hunks)lib/collection/src/telemetry.rs
(2 hunks)lib/common/common/src/types.rs
(2 hunks)lib/segment/src/common/anonymize.rs
(1 hunks)lib/segment/src/common/operation_time_statistics.rs
(4 hunks)lib/segment/src/types.rs
(1 hunks)lib/storage/src/content_manager/toc/mod.rs
(2 hunks)src/actix/api/service_api.rs
(1 hunks)src/common/metrics.rs
(1 hunks)src/common/telemetry_ops/collections_telemetry.rs
(2 hunks)tests/openapi/test_service.py
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (6)
tests/openapi/test_service.py (1)
tests/openapi/helpers/helpers.py (1)
request_with_validation
(39-93)
lib/collection/src/shards/forward_proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
get_optimization_status
(75-77)get_size_stats
(79-81)lib/collection/src/shards/local_shard/telemetry.rs (2)
get_optimization_status
(68-76)get_size_stats
(78-96)lib/collection/src/shards/proxy_shard.rs (2)
get_optimization_status
(151-153)get_size_stats
(155-157)lib/collection/src/shards/replica_set/telemetry.rs (2)
get_optimization_status
(33-38)get_size_stats
(40-48)lib/collection/src/shards/queue_proxy_shard.rs (2)
get_optimization_status
(191-195)get_size_stats
(197-199)lib/collection/src/shards/shard.rs (2)
get_optimization_status
(91-99)get_size_stats
(101-109)
lib/collection/src/shards/shard.rs (5)
lib/collection/src/shards/proxy_shard.rs (2)
get_optimization_status
(151-153)get_size_stats
(155-157)lib/collection/src/shards/forward_proxy_shard.rs (2)
get_optimization_status
(348-350)get_size_stats
(352-354)lib/collection/src/shards/queue_proxy_shard.rs (2)
get_optimization_status
(191-195)get_size_stats
(197-199)lib/collection/src/shards/dummy_shard.rs (2)
get_optimization_status
(75-77)get_size_stats
(79-81)lib/collection/src/shards/local_shard/telemetry.rs (2)
get_optimization_status
(68-76)get_size_stats
(78-96)
lib/collection/src/shards/proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
get_optimization_status
(75-77)get_size_stats
(79-81)lib/collection/src/shards/local_shard/telemetry.rs (2)
get_optimization_status
(68-76)get_size_stats
(78-96)lib/collection/src/shards/forward_proxy_shard.rs (2)
get_optimization_status
(348-350)get_size_stats
(352-354)lib/collection/src/shards/replica_set/telemetry.rs (2)
get_optimization_status
(33-38)get_size_stats
(40-48)lib/collection/src/shards/queue_proxy_shard.rs (2)
get_optimization_status
(191-195)get_size_stats
(197-199)lib/collection/src/shards/shard.rs (2)
get_optimization_status
(91-99)get_size_stats
(101-109)
lib/collection/src/shards/queue_proxy_shard.rs (6)
lib/collection/src/shards/dummy_shard.rs (2)
get_optimization_status
(75-77)get_size_stats
(79-81)lib/collection/src/shards/local_shard/telemetry.rs (2)
get_optimization_status
(68-76)get_size_stats
(78-96)lib/collection/src/shards/proxy_shard.rs (2)
get_optimization_status
(151-153)get_size_stats
(155-157)lib/collection/src/shards/forward_proxy_shard.rs (2)
get_optimization_status
(348-350)get_size_stats
(352-354)lib/collection/src/shards/replica_set/telemetry.rs (2)
get_optimization_status
(33-38)get_size_stats
(40-48)lib/collection/src/shards/shard.rs (2)
get_optimization_status
(91-99)get_size_stats
(101-109)
lib/collection/src/shards/local_shard/telemetry.rs (8)
lib/segment/src/vector_storage/common.rs (1)
get_async_scorer
(11-13)lib/collection/src/shards/dummy_shard.rs (3)
get_telemetry_data
(60-73)get_optimization_status
(75-77)get_size_stats
(79-81)lib/collection/src/shards/proxy_shard.rs (3)
get_telemetry_data
(147-149)get_optimization_status
(151-153)get_size_stats
(155-157)lib/collection/src/shards/forward_proxy_shard.rs (3)
get_telemetry_data
(344-346)get_optimization_status
(348-350)get_size_stats
(352-354)lib/collection/src/collection/mod.rs (1)
get_telemetry_data
(776-806)lib/collection/src/shards/queue_proxy_shard.rs (3)
get_telemetry_data
(185-189)get_optimization_status
(191-195)get_size_stats
(197-199)lib/collection/src/shards/shard.rs (3)
get_telemetry_data
(73-89)get_optimization_status
(91-99)get_size_stats
(101-109)lib/common/common/src/types.rs (1)
default
(68-73)
🔇 Additional comments (60)
lib/segment/src/types.rs (1)
421-427
: Clean implementation of size statistics struct.The
SizeStats
struct provides a standardized way to collect and report size-related metrics across different shard types. The fields chosen cover the essential metrics for monitoring vector storage usage.lib/segment/src/common/anonymize.rs (2)
95-99
: Good implementation of u64 hashing.The
hash_u64
function provides a clean way to anonymize u64 values (like peer IDs) using Rust's default hasher, which is appropriate for telemetry anonymization purposes.
101-111
: Well-implemented anonymization for u64-keyed collections.This function follows the pattern of existing anonymization utilities and enables anonymization of collections with u64 keys, which is needed for including hashed peer IDs in telemetry data while maintaining privacy.
src/actix/api/service_api.rs (1)
82-82
: Increased telemetry detail level for metrics endpoint.Changing from Level1 to Level3 aligns with the PR objective of providing more granular telemetry data. This will expose additional statistics including shard states with hashed peer IDs.
lib/collection/src/shards/forward_proxy_shard.rs (4)
15-15
: Updated imports to include SizeStats.Added
SizeStats
to the import list from segment types to support the newget_size_stats
method.
29-30
: Updated operation types imports.Added
OptimizersStatus
to support the newget_optimization_status
method and organized imports alphabetically.
348-350
: Added method to expose optimization status.This method properly delegates to the wrapped local shard, maintaining the proxy pattern and ensuring consistent behavior across different shard types.
352-354
: Added method to expose size statistics.This method properly delegates to the wrapped local shard, consistent with the implementation in other shard types like
ProxyShard
andQueueProxyShard
.lib/collection/src/shards/proxy_shard.rs (3)
17-17
: New import for SizeStats looks goodThe addition of
SizeStats
to the imports is necessary for the newget_size_stats
method implementation.
31-32
: Adding OptimizersStatus is correct and well-placedThe import of
OptimizersStatus
has been appropriately added to the existing imports block for operation types.
151-157
: Implementation of telemetry methods follows delegation pattern correctlyThe new methods
get_optimization_status
andget_size_stats
properly delegate to the corresponding methods on the wrapped shard, maintaining the proxy pattern consistently. This matches the implementation pattern seen in other proxy shard types (ForwardProxyShard, QueueProxyShard) and ensures consistent telemetry data gathering across all shard types.lib/collection/src/shards/shard.rs (3)
10-10
: SizeStats import added correctlyThe import for
SizeStats
is appropriately added to the existing segment types import.
15-15
: OptimizersStatus import added appropriatelyThe import for
OptimizersStatus
is correctly added to the existing operations types import.
91-109
: Shard enum methods follow consistent implementation patternThe new methods
get_optimization_status
andget_size_stats
correctly match on each shard variant and delegate to the corresponding method implementation. This follows the same pattern as other methods in theShard
enum (likeget_telemetry_data
), ensuring a unified interface across all shard types.This implementation enables higher-level code to access optimization status and size statistics without needing to know the specific shard implementation, furthering the abstraction.
src/common/metrics.rs (1)
441-441
: Correctly handles nullable total_duration_microsThis change adapts the metrics generation to handle the now-optional
total_duration_micros
field by usingunwrap_or(0)
. This prevents potential panics if the field isNone
and aligns with similar changes for other duration fields seen in the code.This reflects the structural changes to the
OperationDurationStatistics
where fields are now optional, likely to support more granular telemetry detail levels.tests/openapi/test_service.py (2)
44-44
: Test updated to verify avg_duration_micros field presenceThe test now correctly verifies the presence of the
avg_duration_micros
field in the telemetry response, which aligns with the updated telemetry structure.
46-61
: Good addition of detailed telemetry testThis new test properly validates telemetry at a higher detail level by using the
details_level=10
parameter. It verifies the response contains expected data including collection information and request metrics.The test provides good coverage for the new telemetry detail level functionality and ensures that expanded telemetry data is properly returned.
lib/storage/src/content_manager/toc/mod.rs (2)
29-29
: Well-structured import addition.
No issues found with addingCollectionsAggregatedTelemetry
; it aligns with the new usage further down.
479-491
:✅ Verification successful
Consider adding error handling for asynchronous calls.
Currently, the method does not handle any potential errors fromcollection.get_aggregated_telemetry_data()
. If there's a scenario where it can fail, consider returning aResult<Vec<CollectionsAggregatedTelemetry>>
or other form of error propagation. Otherwise, if it never fails, this is fine.
🏁 Script executed:
#!/bin/bash # Searching for the definition to confirm if get_aggregated_telemetry_data() can fail. rg -A 10 "async fn get_aggregated_telemetry_data"Length of output: 1875
Refined Direction on Telemetry Error Handling
The verification shows that the
collection.get_aggregated_telemetry_data()
function (inlib/collection/src/collection/mod.rs
) always returns a concreteCollectionsAggregatedTelemetry
. It internally uses a fallback (unwrap_or(OptimizersStatus::Ok)
) when awaitingshard.get_optimization_status()
, which indicates that errors are intentionally absorbed rather than propagated. Given this design choice, additional error handling in the caller (i.e., in the method withinlib/storage/src/content_manager/toc/mod.rs
) isn’t necessary. However, it may be beneficial to document this design decision so that future maintainers understand the rationale.lib/collection/src/shards/queue_proxy_shard.rs (4)
17-18
: Expanded imports for queue proxy shard.
The newly includedSizeStats
import correlates with the subsequent usage inget_size_stats()
.
31-32
: Importing necessary result and status types.
These added imports align with the newly introduced telemetry and optimization functionalities.
191-195
: Delegated optimization status retrieval.
Forwarding the call to the underlying shard is consistent with the proxy design.
197-199
: Delegated size statistics retrieval.
Straightforward proxy to the underlying local shard’s size stats.lib/collection/src/shards/local_shard/mod.rs (2)
9-9
: Telemetry module extraction.
Introducingmod telemetry;
cleanly separates telemetry-related logic from the rest of the shard code for better maintainability.
15-15
: Atomic concurrency utilities.
The addition ofAtomicBool
andAtomicUsize
looks appropriate for concurrent shard workflows.lib/collection/src/shards/replica_set/mod.rs (3)
7-7
: Good move to encapsulate shard telemetry in a dedicated module.
This separation improves code organization and maintainability by isolating all telemetry-related functionality withintelemetry.rs
.
20-20
: Importing theAnonymize
trait looks correct.
As you continue developing anonymized telemetry, verify that all sensitive fields are protected and tested thoroughly.
1191-1193
: Adding theAnonymize
derive toReplicaState
is consistent with anonymization goals.
Ensure that you provide test coverage for all variant fields, confirming they’re properly handled.lib/collection/src/telemetry.rs (3)
10-10
: New typesOptimizersStatus
,ReshardingInfo
, andShardTransferInfo
are appropriately imported.
No issues spotted with this extension of telemetry data references.
25-25
: Verify necessity of removing the default anonymization forshard_clean_tasks
.
Double-check whether these task details may contain sensitive or identifiable information. If so, consider preserving or adjusting anonymization.
29-34
:CollectionsAggregatedTelemetry
struct is a clear and concise addition.
Its fields align with the goals of aggregated telemetry compliance, and the usage ofAnonymize
is consistent.lib/common/common/src/types.rs (5)
40-43
: Doc comments forLevel0
enhance clarity on minimal telemetry.
The added explanations help users understand what data is included at this level.
45-52
: Expanded documentation forLevel1
.
Nicely clarifies the scope of common info that is captured at this level.
54-58
: Added commentary forLevel2
underscores consensus details.
This change is consistent with capturing more advanced cluster and collection configuration info.
59-64
: IntroducingLevel3
andLevel4
with descriptive doc comments.
Helpful for explicitly distinguishing advanced telemetry, such as optimizer info and segment-level insights.
81-84
: Extended mapping fromusize
toDetailsLevel
is valid.
Correctly defaults all values above 4 toLevel4
, ensuring graceful handling of higher user inputs.lib/collection/src/collection/mod.rs (2)
56-58
: Imports refactored for enhanced telemetry capabilities.Good organization of telemetry-related imports, which helps maintain clarity as telemetry features expand.
777-787
: Improved telemetry data handling with conditional collection.The conditional collection of shard telemetry based on the detail level is an excellent optimization. By only collecting detailed shard telemetry when necessary (level >= Level3), you reduce unnecessary processing and data transfer for lower detail levels.
lib/collection/src/shards/dummy_shard.rs (3)
13-15
: Updated imports to support telemetry enhancements.Properly updated imports to include
SizeStats
which aligns with the telemetry interface enhancements.
60-73
: Extended telemetry data structure with optional fields.The
LocalShardTelemetry
structure has been enhanced with new optional fields for more detailed telemetry. The use ofNone
for these values in the dummy shard implementation is appropriate as it correctly indicates that these statistics aren't available for this type of shard.
75-82
: Consistent telemetry interface across shard types.Adding the
get_optimization_status
andget_size_stats
methods to theDummyShard
ensures a consistent interface across all shard types, which simplifies the collection layer's telemetry aggregation logic. The default return values are appropriate for this dummy implementation.The methods correctly align with the interface expected by the new
get_aggregated_telemetry_data
method in the Collection struct.docs/redoc/master/openapi.json (2)
12095-12133
: Correctly madetotal_duration_micros
optional and nullable inOperationDurationStatistics
.The field
total_duration_micros
is now nullable and not required, matching the Rust struct's use ofOption<u64>
. This change is correct and ensures the OpenAPI schema accurately reflects the backend data model. No issues found.
12189-12195
: Correctly madelog
optional and nullable inOptimizerTelemetry
.The
log
field is now nullable and not required, aligning with the Rust struct's use ofOption<Vec<...>>
. This is a correct and necessary change for accurate API documentation and client compatibility.lib/collection/src/shards/replica_set/telemetry.rs (4)
1-7
: Imports look good.The imports are well-organized and include all necessary dependencies for the telemetry implementation.
8-31
: Implementation of get_telemetry_data is well structured.The implementation efficiently collects telemetry data from both local and remote shards, properly handling the case where local shard might be absent. The code correctly uses async/await patterns and thread-safe access through read locks.
33-38
: LGTM: Concise implementation for retrieving optimization status.This method elegantly maps the Option through a closure to retrieve the optimization status only when a local shard exists.
40-48
: Good handling of default case in get_size_stats.The method correctly retrieves size statistics from the local shard when available, or returns default SizeStats when no local shard exists, which is a robust approach.
lib/segment/src/common/operation_time_statistics.rs (5)
5-6
: Good update to imports for detail level.The import of Level1 from DetailsLevel enables cleaner code when checking telemetry detail levels.
43-45
: Appropriate change to make total_duration_micros optional.Making this field optional with serialization skipping if None allows for more control over telemetry detail levels.
137-142
: Well-implemented handling of optional total_duration_micros in Add trait.The pattern matching handles all possible combinations of the optional values correctly.
300-310
: Good implementation of conditional field inclusion based on detail level.The code now conditionally includes telemetry fields based on the detail level, using the
then_some
andflatten
pattern effectively. This is a clean approach to implementing the detail levels.
43-45
:✅ Verification successful
Verify backward compatibility with API consumers.
The change from required
u64
to optionalOption<u64>
fortotal_duration_micros
might affect consumers of this API. Ensure that any code directly accessing this field has been updated to handle theOption
type.Also applies to: 137-142, 305-310
🏁 Script executed:
#!/bin/bash # Search for direct references to total_duration_micros field rg -A 2 "\.total_duration_micros"Length of output: 591
API Consumer Backward Compatibility Verified
The recent changes converting
total_duration_micros
from a requiredu64
to an optionalOption<u64>
have been internally updated. Our search confirms the following:
In
src/common/metrics.rs
:
The field is accessed viastat.total_duration_micros.unwrap_or(0)
, which correctly defaults to0
when the value isNone
.In
lib/segment/src/common/operation_time_statistics.rs
:
The code now uses pattern matching on(self.total_duration_micros, other.total_duration_micros)
to handle allOption
cases appropriately.While these updates ensure proper handling within the codebase, please ensure that any external API consumers are aware of the change in serialization (i.e., the absence of the key when the value is
None
) and have updated their integrations accordingly.lib/collection/src/shards/telemetry.rs (4)
4-5
: Improved anonymization import for better flexibility.The explicit import of
anonymize_collection_with_u64_hashable_key
enables more specific anonymization of collections with u64 keys.
22-23
: Enhanced anonymization for replica states.Using
anonymize_collection_with_u64_hashable_key
for the replicate_states field provides better anonymization while still maintaining the ability to identify related peers by their hashed IDs.
43-60
: Well-documented addition of size metrics to LocalShardTelemetry.The new optional fields provide valuable size and count metrics with appropriate warnings about their approximate nature. The documentation clearly indicates these values should not be strictly relied upon.
71-72
: Good change to make optimizer logs optional.Making the log field optional in OptimizerTelemetry with serialization skipping if None helps reduce response size when detailed logs aren't needed.
src/common/telemetry_ops/collections_telemetry.rs (2)
1-1
: Updated import to use new aggregated telemetry structure.Importing CollectionsAggregatedTelemetry from the collection telemetry module is a good refactoring approach.
28-40
: Improved conditional telemetry collection based on detail level.The implementation now properly branches based on the telemetry detail level, using the appropriate method for each level. For Level2 and above, it collects full telemetry data, while for lower levels it fetches aggregated data. This approach is more efficient than always fetching full telemetry and then converting.
lib/collection/src/shards/local_shard/telemetry.rs (2)
68-76
: Well-implemented optimizer status retrieval methodThis method correctly abstracts the optimizer status logic that was previously embedded in the
get_telemetry_data
method. The lock handling is appropriate, acquiring the lock only for the duration needed to check the optimizer errors.
78-96
: Good implementation of size statistics aggregationThe
get_size_stats
method efficiently aggregates size statistics across all segments. The lock handling is appropriate, and the method returns accumulated statistics correctly.
/// Minimal information level | ||
/// - app info | ||
/// - minimal telemetry by endpoint | ||
/// - cluster status | ||
Level0, | ||
/// Detailed common info level | ||
/// - app info details | ||
/// - system info | ||
/// - hardware flags | ||
/// - hardware usage per collection | ||
/// - RAM usage | ||
/// - cluster basic details | ||
/// - collections basic info | ||
Level1, | ||
/// Detailed consensus info - peers info | ||
/// Collections: | ||
/// - detailed config | ||
/// - Shards - basic config | ||
Level2, | ||
/// Shards: | ||
/// - detailed config | ||
/// - Optimizers info | ||
Level3, | ||
/// Segment level telemetry | ||
Level4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break-down is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it also make sense to include the dev feature flags into Level1?
d4f95a6
to
69faf8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only relevant change I'd like is to not output an empty list for "shards"
when details_level < 3.
I made a PR (#6398) to fix this and the same for "resharding"
and "transfers"
. I didn't commit directly in case we want to keep these two others as basic shards info.
/// Minimal information level | ||
/// - app info | ||
/// - minimal telemetry by endpoint | ||
/// - cluster status | ||
Level0, | ||
/// Detailed common info level | ||
/// - app info details | ||
/// - system info | ||
/// - hardware flags | ||
/// - hardware usage per collection | ||
/// - RAM usage | ||
/// - cluster basic details | ||
/// - collections basic info | ||
Level1, | ||
/// Detailed consensus info - peers info | ||
/// Collections: | ||
/// - detailed config | ||
/// - Shards - basic config | ||
Level2, | ||
/// Shards: | ||
/// - detailed config | ||
/// - Optimizers info | ||
Level3, | ||
/// Segment level telemetry | ||
Level4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it also make sense to include the dev feature flags into Level1?
* allow shard states in anonymize telemetry (with hashed peer ids) * introduce level3 and level4 for telemetry, where level4 = everything, incl. segments info * upd openapi * fix tests * expose vector count & size stats on shard level to avoid reading of segments * fix spelling * upd schema * fix tests * Use unwrap_or_default * [qdrant#6390] skip serializing shard details in Level2 (qdrant#6398) * skip serializing shard details in Level2 * upd openapi --------- Co-authored-by: generall <andrey@vasnetsov.com> --------- Co-authored-by: Tim Visée <tim+github@visee.me> Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Main idea: for automated tools like Cluster Manager it should be enough to work with level-3. Level-4 is quite expensive in terms of request size and should only be required for in-depth performance debug