Anonymize ShardKey in ReplicaSetTelemetry #6115

xzfc · 2025-03-05T17:11:53Z

I've audited all Anonymize implementations and added a few missing anonymizations:

Vector names inside CollectionConfigInternal::{params, strict_mode_config}.
String shard keys.

No-op implementations

Also, in this PR, I've added a few no-op Anonymize implementations. The reasoning is explained below.

In our codebase, we just call .clone() for structures that do not require anonymization, like this:

struct Foo {
   some_structure: SomeStructure,
   …
}
impl Anonymize for Foo {
   fn anonymize(&self) -> Self {
      Self {
          some_structure: self.some_structure.clone(), // !
          …
      }
   }
}

However, even if we don't need to anonymize the contents of SomeStructure right now, it is possible that we could add some sensitive field in the future. Such additions could be slip by unnoticed during code review.

A better approach is to implement Anonymize explicitly, even if the result is a clone. Let the anonymization (or lack thereof) be the responsibility of the structure itself, not of the holders of that structure, like this:

impl Anonymize for Foo {
   fn anonymize(&self) -> Self {
      Self {
          some_structure: self.some_structure.anonymize(), // !
          …
      }
   }
}

impl Anonymize for SomeStructure {
   fn anonymize(&self) -> Self {
      // no-op
      Self {
          field1: self.field1,
          field2: self.field2,
          …
      }
   }
}

In this PR, commits with such changes are prefixed as no-op.
And I suggest to treat every .clone() inside Anonymize as suspicious.

Error messages

These no-op changes revealed that we do not anonymize error messages. Potentially, some sensitive strings could be included in them (e.g., file paths containing collection names). I'm not sure whether we should anonymize/clear them, or keep them as is. However, if we keep them, I believe, it should be stated explicitly. In this PR, they're explicitly cloned (but behavior is unchanged).

…y and CollectionsAggregatedTelemetry)

… ClusterStatusTelemetry)

coderabbitai · 2025-03-05T17:15:34Z

📝 Walkthrough

Walkthrough

This pull request standardizes data anonymization by implementing and updating the Anonymize trait across multiple modules. New implementations are added for various structs and enums such as HardwareUsage, TrackerStatus, OptimizersStatus, and several telemetry and configuration types. In many cases, fields that were previously cloned or spread are now explicitly anonymized via direct calls to the anonymize method. Adjustments are made to ensure that sensitive fields—ranging from error messages and timestamps to configuration parameters—are processed consistently. These changes are applied in different parts of the project including API models, collection managers, telemetry structures, segment configurations, and storage types. The overall modification reflects a uniform approach to handling sensitive data by replacing basic cloning with a structured anonymization procedure.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/common/telemetry_ops/hardware.rs (1)
41-43: Improved code conciseness by leveraging HashMap's anonymize implementation

The implementation has been simplified to use the anonymize method on the collection_data HashMap directly, rather than manually iterating and collecting.

This change improves code readability and maintainability by leveraging a pre-existing implementation. The behavior should be equivalent to the previous implementation (which likely mapped over each key-value pair, anonymizing keys and possibly cloning values).

However, since we're now using anonymize() on the entire HashMap, I should verify whether the values (HardwareUsage) are also being anonymized properly:
#!/bin/bash
# Check if HashMap<K, V> has an Anonymize implementation where both K and V are anonymized
rg -A 10 "impl.*Anonymize.*for.*HashMap" --type rust

# Also check how HardwareUsage is anonymized in other places
rg -A 5 "HardwareUsage.*anonymize" --type rust
src/common/telemetry_ops/cluster_telemetry.rs (2)

127-128: Code simplification improves maintainability.

The code has been simplified by directly calling anonymize() on the Option fields instead of using the previous approach with clone().map(|x| x.anonymize()). This is a cleaner implementation that relies on the fact that Option<T> implements Anonymize when T does.

145-145: Direct anonymization of consensus_thread_status eliminates unnecessary cloning.

The change avoids redundant cloning by directly anonymizing the field rather than cloning it first. This is a good practice as it improves performance by reducing unnecessary memory allocations, especially for potentially large or complex structures.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ced9dbe and 7b60ce6.

📒 Files selected for processing (11)

lib/api/src/rest/models.rs (2 hunks)
lib/collection/src/collection_manager/optimizers/mod.rs (2 hunks)
lib/collection/src/operations/types.rs (2 hunks)
lib/collection/src/shards/telemetry.rs (3 hunks)
lib/collection/src/telemetry.rs (1 hunks)
lib/segment/src/common/operation_time_statistics.rs (1 hunks)
lib/segment/src/types.rs (7 hunks)
lib/storage/src/types.rs (2 hunks)
src/common/telemetry_ops/cluster_telemetry.rs (2 hunks)
src/common/telemetry_ops/collections_telemetry.rs (1 hunks)
src/common/telemetry_ops/hardware.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test-consensus
GitHub Check: test (macos-latest)
GitHub Check: test (windows-latest)
GitHub Check: test
GitHub Check: test
GitHub Check: test (ubuntu-latest)
GitHub Check: test

🔇 Additional comments (19)

lib/api/src/rest/models.rs (1)

94-104: Implementation is appropriate for hardware metrics

The "no-op" implementation of Anonymize for HardwareUsage is consistent with the PR's stated goal of ensuring that structures which currently do not require anonymization are still explicitly defined with the Anonymize trait.

This implementation correctly follows the pattern mentioned in the PR objectives, placing the responsibility of anonymization within the structure itself rather than relying on external holders. Hardware usage metrics like CPU and I/O operations don't contain sensitive information, so a simple clone is appropriate.

src/common/telemetry_ops/collections_telemetry.rs (1)

103-103: Properly delegates anonymization to the optimizers_status field

The change properly replaces a direct clone with a call to the anonymize method on the optimizers_status field.

This change ensures proper anonymization of the optimizers_status field by delegating to its own Anonymize implementation rather than just cloning it. This is consistent with the overall anonymization strategy being implemented across the codebase.

lib/segment/src/common/operation_time_statistics.rs (1)

105-108: Implementation now properly handles numeric statistics for anonymization.

This change improves clarity by explicitly including the numeric statistics fields in the anonymized result rather than relying on the spread operator. Since these are numerical statistics without sensitive information (averages, minimums, maximums, and totals), direct copying is appropriate.

lib/collection/src/telemetry.rs (2)

52-52: Properly anonymize collection parameters.

Good change replacing direct cloning with anonymization. This ensures that any potentially sensitive information in collection parameters (like vector names or custom metadata) gets properly anonymized.

57-57: Properly anonymize strict mode configuration.

Good change applying proper anonymization to the strict mode configuration instead of direct cloning, ensuring any sensitive settings are protected.
lib/storage/src/types.rs (2)

277-277: Consistent application of anonymization.

Good improvement using the newly implemented anonymize method for the consensus thread status instead of cloning. This maintains consistency with the anonymization approach across the codebase.

234-244:
✅ Verification successful

Appropriate implementation of Anonymize for ConsensusThreadStatus.

This implementation correctly handles different variants of the ConsensusThreadStatus enum:

For Working: Preserves the timestamp which is not sensitive

For Stopped: Simple pass-through as there's no data to anonymize

For StoppedWithErr: Clones the error message

Note that error messages might contain sensitive information like file paths, but this implementation preserves them. Consider whether error messages should be further anonymized in the future.

🏁 Script executed:
#!/bin/bash
# Check for instances where error messages might contain sensitive information

# Look for cases where file paths might be included in error messages
rg -g "*.rs" "StoppedWithErr.*\{.*err.*path|StoppedWithErr.*\{.*err.*file"
Length of output: 76
Implementation Verified – No Sensitive Data Found

The code for ConsensusThreadStatus::anonymize appears appropriate. Our additional search for file path or filename mentions within error messages produced no results, which confirms that there aren’t any obvious leaks of sensitive file paths. However, please note that cloning error messages might still preserve other sensitive information. Consider this factor when future requirements for anonymization arise.

The Working variant correctly preserves non-sensitive timestamp data.

The Stopped variant appropriately remains unchanged.

The StoppedWithErr variant clones error messages without modification, and our check did not reveal exposure of file paths or filenames.
lib/collection/src/shards/telemetry.rs (3)

54-54: Properly anonymize optimizer status.

Good change replacing direct cloning with proper anonymization for the optimizer status field. This provides more comprehensive protection for potentially sensitive information.

79-79: Properly anonymize tracker status.

Good improvement applying anonymization to the tracker status instead of directly cloning. This ensures consistent handling of potentially sensitive status information.

101-101: Properly anonymize ShardKey as mentioned in PR objectives.

This change directly implements the PR goal of anonymizing string shard keys. ShardKeys can contain user-defined values that should be protected when telemetry data is collected or transmitted.

lib/collection/src/operations/types.rs (2)

126-133: Good implementation of the Anonymize trait for OptimizersStatus.

The implementation correctly handles both variants while preserving the error information. This follows best practices by properly implementing anonymization for error messages, which might contain sensitive information.

824-825: Improved anonymization for the VectorsConfig's Single variant.

The change properly propagates anonymization to the parameters object rather than just cloning it. This ensures that any sensitive information in vector parameters is properly anonymized.

lib/segment/src/types.rs (7)

27-27: Clean import addition.

Added the Anonymize trait import to implement anonymization functionality throughout this file.

714-720: Good implementation of the Anonymize trait for StrictModeSparse.

The implementation correctly creates a new instance with the same max_length value. Since max_length is a numeric type (Option<usize>), it doesn't need special anonymization handling.

738-744: Good implementation of the Anonymize trait for StrictModeSparseConfig.

The implementation properly delegates to the anonymize method of the config field, ensuring that vector names in the BTreeMap are anonymized.

756-762: Good implementation of the Anonymize trait for StrictModeMultivector.

Similar to StrictModeSparse, the implementation creates a new instance with the same max_vectors value, which doesn't require anonymization.

781-787: Good implementation of the Anonymize trait for StrictModeMultivectorConfig.

The implementation properly delegates to the anonymize method of the config field, ensuring that any sensitive information in the BTreeMap is anonymized.

918-941: Comprehensive implementation of the Anonymize trait for StrictModeConfig.

The implementation creates a new instance with all primitive fields copied directly, while properly anonymizing the nested config fields (multivector_config and sparse_config). This ensures that only fields containing sensitive information are anonymized while preserving the structure.

4181-4188: Correct implementation of the Anonymize trait for ShardKey.

The implementation properly handles both variants:

For Keyword variant - Calls the anonymize method on the string content, ensuring sensitive data is properly anonymized

For Number variant - Simply copies the numeric value, as numbers typically don't contain sensitive information

This implementation aligns with the PR objective of adding missing anonymizations for string shard keys.

coderabbitai · 2025-03-05T17:15:37Z

lib/collection/src/collection_manager/optimizers/mod.rs

+impl Anonymize for TrackerStatus {
+    fn anonymize(&self) -> Self {
+        match self {
+            Self::Optimizing => Self::Optimizing,
+            Self::Done => Self::Done,
+            Self::Cancelled(e) => Self::Cancelled(e.clone()),
+            Self::Error(e) => Self::Error(e.clone()),
+        }
+    }
+}


💡 Verification agent

❓ Verification inconclusive

Consider stronger anonymization for error messages

The current implementation simply clones the error messages without any actual anonymization. However, as mentioned in the PR objectives, error messages might contain sensitive information such as file paths.

Given that the PR aims to enhance anonymization processes, you might want to implement a more comprehensive anonymization strategy for the error strings in TrackerStatus::Cancelled and TrackerStatus::Error variants. For example:

impl Anonymize for TrackerStatus { fn anonymize(&self) -> Self { match self { Self::Optimizing => Self::Optimizing, Self::Done => Self::Done, - Self::Cancelled(e) => Self::Cancelled(e.clone()), - Self::Error(e) => Self::Error(e.clone()), + Self::Cancelled(e) => Self::Cancelled(anonymize_error_message(e)), + Self::Error(e) => Self::Error(anonymize_error_message(e)), } } } // Helper function to anonymize potential sensitive info in error messages fn anonymize_error_message(message: &str) -> String { // Implement logic to remove file paths, credentials, etc. // For example, using regex to replace patterns like /home/user/... with [REDACTED_PATH] message.clone() }

Let me verify if there's a standard way of handling error message anonymization in the codebase:

🏁 Script executed:

#!/bin/bash # Look for examples of error message anonymization in the codebase rg -A 3 -B 3 "anonymize.*error|error.*anonymize" --type rust

Length of output: 62

Stronger Error Message Anonymization Required

The current implementation for TrackerStatus simply clones error messages for the Cancelled and Error variants, which risks exposing sensitive details (e.g., file paths). As our PR aims to improve anonymization processes, please implement a dedicated mechanism to scrub sensitive data from these messages. It appears that no standard anonymization function exists elsewhere in the codebase, so a new helper (e.g., anonymize_error_message) must be defined and used.

Location: lib/collection/src/collection_manager/optimizers/mod.rs (lines 180–189)

Suggested Change:

impl Anonymize for TrackerStatus { fn anonymize(&self) -> Self { match self { Self::Optimizing => Self::Optimizing, Self::Done => Self::Done, - Self::Cancelled(e) => Self::Cancelled(e.clone()), - Self::Error(e) => Self::Error(e.clone()), + Self::Cancelled(e) => Self::Cancelled(anonymize_error_message(e)), + Self::Error(e) => Self::Error(anonymize_error_message(e)), } } } // Helper function for scrubbing sensitive data from error messages fn anonymize_error_message(message: &str) -> String { // TODO: Implement logic (e.g., regex replacement of file paths, credentials, etc.) message.to_string() // Placeholder }

Please ensure the new helper robustly removes any sensitive information and aligns with our overall anonymization strategy.

* Anonymize ShardKey in ReplicaSetTelemetry * Anonymize vector names in CollectionConfigInternal * no-op: impl HardwareUsage for HardwareUsage (used in HardwareTelemetry) * no-op: impl Anonymize for OptimizersStatus (used in OptimizerTelemetry and CollectionsAggregatedTelemetry) * no-op: impl Anonymize for TrackerStatus in (used in TrackerTelemetry) * no-op: Anonymize VectorParams in VectorsConfig * no-op: Copy fields explicitly in OperationDurationStatistics * no-op: `.clone().map(|x| x.anonymize())` -> `.anonymize()` * no-op: impl Anonymize for ConsensusThreadStatus (used in PeerInfo and ClusterStatusTelemetry)

xzfc added 9 commits March 5, 2025 15:51

Anonymize ShardKey in ReplicaSetTelemetry

87c7ef3

Anonymize vector names in CollectionConfigInternal

ac14394

no-op: impl HardwareUsage for HardwareUsage (used in HardwareTelemetry)

675db66

no-op: impl Anonymize for OptimizersStatus (used in OptimizerTelemetr…

82c7ae8

…y and CollectionsAggregatedTelemetry)

no-op: impl Anonymize for TrackerStatus in (used in TrackerTelemetry)

55bff2f

no-op: Anonymize VectorParams in VectorsConfig

e1476c8

no-op: Copy fields explicitly in OperationDurationStatistics

335104b

no-op: .clone().map(|x| x.anonymize()) -> .anonymize()

54123a0

no-op: impl Anonymize for ConsensusThreadStatus (used in PeerInfo and…

7b60ce6

… ClusterStatusTelemetry)

xzfc requested review from timvisee and generall March 5, 2025 17:11

coderabbitai bot reviewed Mar 5, 2025

View reviewed changes

github-actions bot mentioned this pull request Mar 5, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

generall approved these changes Mar 5, 2025

View reviewed changes

generall merged commit 749b9b2 into dev Mar 5, 2025
17 checks passed

generall deleted the missing-anonymize branch March 5, 2025 18:04

coderabbitai bot mentioned this pull request Mar 5, 2025

Separate StrictModeConfig output structure #6114

Merged

xzfc mentioned this pull request Mar 12, 2025

Add Anonymize macro #6160

Merged

coderabbitai bot mentioned this pull request Mar 20, 2025

dont anonymize shard transfers #6208

Merged

timvisee mentioned this pull request Mar 21, 2025

Bump version to 1.13.5 #6223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anonymize ShardKey in ReplicaSetTelemetry #6115

Anonymize ShardKey in ReplicaSetTelemetry #6115

Uh oh!

xzfc commented Mar 5, 2025

Uh oh!

coderabbitai bot commented Mar 5, 2025

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Anonymize ShardKey in ReplicaSetTelemetry #6115

Anonymize ShardKey in ReplicaSetTelemetry #6115

Uh oh!

Conversation

xzfc commented Mar 5, 2025

No-op implementations

Error messages

Uh oh!

coderabbitai bot commented Mar 5, 2025

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CodeRabbit Configuration File (`.coderabbit.yaml`)