-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Anonymize ShardKey in ReplicaSetTelemetry #6115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…y and CollectionsAggregatedTelemetry)
… ClusterStatusTelemetry)
📝 WalkthroughWalkthroughThis pull request standardizes data anonymization by implementing and updating the ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/common/telemetry_ops/hardware.rs (1)
41-43
: Improved code conciseness by leveraging HashMap's anonymize implementationThe implementation has been simplified to use the
anonymize
method on thecollection_data
HashMap directly, rather than manually iterating and collecting.This change improves code readability and maintainability by leveraging a pre-existing implementation. The behavior should be equivalent to the previous implementation (which likely mapped over each key-value pair, anonymizing keys and possibly cloning values).
However, since we're now using
anonymize()
on the entire HashMap, I should verify whether the values (HardwareUsage
) are also being anonymized properly:#!/bin/bash # Check if HashMap<K, V> has an Anonymize implementation where both K and V are anonymized rg -A 10 "impl.*Anonymize.*for.*HashMap" --type rust # Also check how HardwareUsage is anonymized in other places rg -A 5 "HardwareUsage.*anonymize" --type rustsrc/common/telemetry_ops/cluster_telemetry.rs (2)
127-128
: Code simplification improves maintainability.The code has been simplified by directly calling
anonymize()
on the Option fields instead of using the previous approach withclone().map(|x| x.anonymize())
. This is a cleaner implementation that relies on the fact thatOption<T>
implementsAnonymize
whenT
does.
145-145
: Direct anonymization of consensus_thread_status eliminates unnecessary cloning.The change avoids redundant cloning by directly anonymizing the field rather than cloning it first. This is a good practice as it improves performance by reducing unnecessary memory allocations, especially for potentially large or complex structures.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
lib/api/src/rest/models.rs
(2 hunks)lib/collection/src/collection_manager/optimizers/mod.rs
(2 hunks)lib/collection/src/operations/types.rs
(2 hunks)lib/collection/src/shards/telemetry.rs
(3 hunks)lib/collection/src/telemetry.rs
(1 hunks)lib/segment/src/common/operation_time_statistics.rs
(1 hunks)lib/segment/src/types.rs
(7 hunks)lib/storage/src/types.rs
(2 hunks)src/common/telemetry_ops/cluster_telemetry.rs
(2 hunks)src/common/telemetry_ops/collections_telemetry.rs
(1 hunks)src/common/telemetry_ops/hardware.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test-consensus
- GitHub Check: test (macos-latest)
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (19)
lib/api/src/rest/models.rs (1)
94-104
: Implementation is appropriate for hardware metricsThe "no-op" implementation of
Anonymize
forHardwareUsage
is consistent with the PR's stated goal of ensuring that structures which currently do not require anonymization are still explicitly defined with theAnonymize
trait.This implementation correctly follows the pattern mentioned in the PR objectives, placing the responsibility of anonymization within the structure itself rather than relying on external holders. Hardware usage metrics like CPU and I/O operations don't contain sensitive information, so a simple clone is appropriate.
src/common/telemetry_ops/collections_telemetry.rs (1)
103-103
: Properly delegates anonymization to the optimizers_status fieldThe change properly replaces a direct clone with a call to the
anonymize
method on theoptimizers_status
field.This change ensures proper anonymization of the
optimizers_status
field by delegating to its ownAnonymize
implementation rather than just cloning it. This is consistent with the overall anonymization strategy being implemented across the codebase.lib/segment/src/common/operation_time_statistics.rs (1)
105-108
: Implementation now properly handles numeric statistics for anonymization.This change improves clarity by explicitly including the numeric statistics fields in the anonymized result rather than relying on the spread operator. Since these are numerical statistics without sensitive information (averages, minimums, maximums, and totals), direct copying is appropriate.
lib/collection/src/telemetry.rs (2)
52-52
: Properly anonymize collection parameters.Good change replacing direct cloning with anonymization. This ensures that any potentially sensitive information in collection parameters (like vector names or custom metadata) gets properly anonymized.
57-57
: Properly anonymize strict mode configuration.Good change applying proper anonymization to the strict mode configuration instead of direct cloning, ensuring any sensitive settings are protected.
lib/storage/src/types.rs (2)
277-277
: Consistent application of anonymization.Good improvement using the newly implemented anonymize method for the consensus thread status instead of cloning. This maintains consistency with the anonymization approach across the codebase.
234-244
:✅ Verification successful
Appropriate implementation of Anonymize for ConsensusThreadStatus.
This implementation correctly handles different variants of the ConsensusThreadStatus enum:
- For Working: Preserves the timestamp which is not sensitive
- For Stopped: Simple pass-through as there's no data to anonymize
- For StoppedWithErr: Clones the error message
Note that error messages might contain sensitive information like file paths, but this implementation preserves them. Consider whether error messages should be further anonymized in the future.
🏁 Script executed:
#!/bin/bash # Check for instances where error messages might contain sensitive information # Look for cases where file paths might be included in error messages rg -g "*.rs" "StoppedWithErr.*\{.*err.*path|StoppedWithErr.*\{.*err.*file"Length of output: 76
Implementation Verified – No Sensitive Data Found
The code for
ConsensusThreadStatus::anonymize
appears appropriate. Our additional search for file path or filename mentions within error messages produced no results, which confirms that there aren’t any obvious leaks of sensitive file paths. However, please note that cloning error messages might still preserve other sensitive information. Consider this factor when future requirements for anonymization arise.
- The
Working
variant correctly preserves non-sensitive timestamp data.- The
Stopped
variant appropriately remains unchanged.- The
StoppedWithErr
variant clones error messages without modification, and our check did not reveal exposure of file paths or filenames.lib/collection/src/shards/telemetry.rs (3)
54-54
: Properly anonymize optimizer status.Good change replacing direct cloning with proper anonymization for the optimizer status field. This provides more comprehensive protection for potentially sensitive information.
79-79
: Properly anonymize tracker status.Good improvement applying anonymization to the tracker status instead of directly cloning. This ensures consistent handling of potentially sensitive status information.
101-101
: Properly anonymize ShardKey as mentioned in PR objectives.This change directly implements the PR goal of anonymizing string shard keys. ShardKeys can contain user-defined values that should be protected when telemetry data is collected or transmitted.
lib/collection/src/operations/types.rs (2)
126-133
: Good implementation of the Anonymize trait for OptimizersStatus.The implementation correctly handles both variants while preserving the error information. This follows best practices by properly implementing anonymization for error messages, which might contain sensitive information.
824-825
: Improved anonymization for the VectorsConfig's Single variant.The change properly propagates anonymization to the parameters object rather than just cloning it. This ensures that any sensitive information in vector parameters is properly anonymized.
lib/segment/src/types.rs (7)
27-27
: Clean import addition.Added the
Anonymize
trait import to implement anonymization functionality throughout this file.
714-720
: Good implementation of theAnonymize
trait forStrictModeSparse
.The implementation correctly creates a new instance with the same
max_length
value. Sincemax_length
is a numeric type (Option<usize>
), it doesn't need special anonymization handling.
738-744
: Good implementation of theAnonymize
trait forStrictModeSparseConfig
.The implementation properly delegates to the
anonymize
method of theconfig
field, ensuring that vector names in the BTreeMap are anonymized.
756-762
: Good implementation of theAnonymize
trait forStrictModeMultivector
.Similar to
StrictModeSparse
, the implementation creates a new instance with the samemax_vectors
value, which doesn't require anonymization.
781-787
: Good implementation of theAnonymize
trait forStrictModeMultivectorConfig
.The implementation properly delegates to the
anonymize
method of theconfig
field, ensuring that any sensitive information in the BTreeMap is anonymized.
918-941
: Comprehensive implementation of theAnonymize
trait forStrictModeConfig
.The implementation creates a new instance with all primitive fields copied directly, while properly anonymizing the nested config fields (
multivector_config
andsparse_config
). This ensures that only fields containing sensitive information are anonymized while preserving the structure.
4181-4188
: Correct implementation of theAnonymize
trait forShardKey
.The implementation properly handles both variants:
- For
Keyword
variant - Calls theanonymize
method on the string content, ensuring sensitive data is properly anonymized- For
Number
variant - Simply copies the numeric value, as numbers typically don't contain sensitive informationThis implementation aligns with the PR objective of adding missing anonymizations for string shard keys.
impl Anonymize for TrackerStatus { | ||
fn anonymize(&self) -> Self { | ||
match self { | ||
Self::Optimizing => Self::Optimizing, | ||
Self::Done => Self::Done, | ||
Self::Cancelled(e) => Self::Cancelled(e.clone()), | ||
Self::Error(e) => Self::Error(e.clone()), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Consider stronger anonymization for error messages
The current implementation simply clones the error messages without any actual anonymization. However, as mentioned in the PR objectives, error messages might contain sensitive information such as file paths.
Given that the PR aims to enhance anonymization processes, you might want to implement a more comprehensive anonymization strategy for the error strings in TrackerStatus::Cancelled
and TrackerStatus::Error
variants. For example:
impl Anonymize for TrackerStatus {
fn anonymize(&self) -> Self {
match self {
Self::Optimizing => Self::Optimizing,
Self::Done => Self::Done,
- Self::Cancelled(e) => Self::Cancelled(e.clone()),
- Self::Error(e) => Self::Error(e.clone()),
+ Self::Cancelled(e) => Self::Cancelled(anonymize_error_message(e)),
+ Self::Error(e) => Self::Error(anonymize_error_message(e)),
}
}
}
// Helper function to anonymize potential sensitive info in error messages
fn anonymize_error_message(message: &str) -> String {
// Implement logic to remove file paths, credentials, etc.
// For example, using regex to replace patterns like /home/user/... with [REDACTED_PATH]
message.clone()
}
Let me verify if there's a standard way of handling error message anonymization in the codebase:
🏁 Script executed:
#!/bin/bash
# Look for examples of error message anonymization in the codebase
rg -A 3 -B 3 "anonymize.*error|error.*anonymize" --type rust
Length of output: 62
Stronger Error Message Anonymization Required
The current implementation for TrackerStatus
simply clones error messages for the Cancelled
and Error
variants, which risks exposing sensitive details (e.g., file paths). As our PR aims to improve anonymization processes, please implement a dedicated mechanism to scrub sensitive data from these messages. It appears that no standard anonymization function exists elsewhere in the codebase, so a new helper (e.g., anonymize_error_message
) must be defined and used.
-
Location:
lib/collection/src/collection_manager/optimizers/mod.rs
(lines 180–189) -
Suggested Change:
impl Anonymize for TrackerStatus { fn anonymize(&self) -> Self { match self { Self::Optimizing => Self::Optimizing, Self::Done => Self::Done, - Self::Cancelled(e) => Self::Cancelled(e.clone()), - Self::Error(e) => Self::Error(e.clone()), + Self::Cancelled(e) => Self::Cancelled(anonymize_error_message(e)), + Self::Error(e) => Self::Error(anonymize_error_message(e)), } } } // Helper function for scrubbing sensitive data from error messages fn anonymize_error_message(message: &str) -> String { // TODO: Implement logic (e.g., regex replacement of file paths, credentials, etc.) message.to_string() // Placeholder }
Please ensure the new helper robustly removes any sensitive information and aligns with our overall anonymization strategy.
* Anonymize ShardKey in ReplicaSetTelemetry * Anonymize vector names in CollectionConfigInternal * no-op: impl HardwareUsage for HardwareUsage (used in HardwareTelemetry) * no-op: impl Anonymize for OptimizersStatus (used in OptimizerTelemetry and CollectionsAggregatedTelemetry) * no-op: impl Anonymize for TrackerStatus in (used in TrackerTelemetry) * no-op: Anonymize VectorParams in VectorsConfig * no-op: Copy fields explicitly in OperationDurationStatistics * no-op: `.clone().map(|x| x.anonymize())` -> `.anonymize()` * no-op: impl Anonymize for ConsensusThreadStatus (used in PeerInfo and ClusterStatusTelemetry)
I've audited all
Anonymize
implementations and added a few missing anonymizations:CollectionConfigInternal::{params, strict_mode_config}
.No-op implementations
Also, in this PR, I've added a few no-op
Anonymize
implementations. The reasoning is explained below.In our codebase, we just call
.clone()
for structures that do not require anonymization, like this:However, even if we don't need to anonymize the contents of
SomeStructure
right now, it is possible that we could add some sensitive field in the future. Such additions could be slip by unnoticed during code review.A better approach is to implement
Anonymize
explicitly, even if the result is a clone. Let the anonymization (or lack thereof) be the responsibility of the structure itself, not of the holders of that structure, like this:In this PR, commits with such changes are prefixed as
no-op
.And I suggest to treat every
.clone()
insideAnonymize
as suspicious.Error messages
These no-op changes revealed that we do not anonymize error messages. Potentially, some sensitive strings could be included in them (e.g., file paths containing collection names). I'm not sure whether we should anonymize/clear them, or keep them as is. However, if we keep them, I believe, it should be stated explicitly. In this PR, they're explicitly cloned (but behavior is unchanged).