-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add Anonymize macro #6160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Anonymize macro #6160
Conversation
📝 WalkthroughWalkthroughThis pull request integrates a unified anonymization approach across the project. A new procedural macro for the Suggested reviewers
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/common/telemetry_ops/requests_telemetry.rs (1)
171-181
: Retained manual implementation for WebApiTelemetry.The manual implementation for
WebApiTelemetry
is retained but now uses theanonymize_collection_values
utility function. Consider if this could also be replaced with a derive implementation using attributes.Consider replacing this manual implementation with a derive-based approach:
-impl Anonymize for WebApiTelemetry { - fn anonymize(&self) -> Self { - let responses = self - .responses - .iter() - .map(|(key, value)| (key.clone(), anonymize_collection_values(value))) - .collect(); - - WebApiTelemetry { responses } - } -}And add the derive and attribute to the struct:
-#[derive(Serialize, Clone, Default, Debug, JsonSchema)] +#[derive(Serialize, Clone, Default, Debug, JsonSchema, Anonymize)] pub struct WebApiTelemetry { + #[anonymize(with = anonymize_collection_values)] pub responses: HashMap<String, HashMap<HttpStatusCode, OperationDurationStatistics>>, }lib/collection/src/shards/telemetry.rs (1)
36-47
: Double-check the sensitivity ofvariant_name
.Using
#[anonymize(false)]
onvariant_name
might inadvertently leak environment or runtime details. If that is intentional, no action needed; otherwise, consider anonymizing it.src/common/telemetry_ops/app_telemetry.rs (1)
51-65
: Explicit non-anonymization ofname
andversion
.Retaining application name and version is usually fine for telemetry. Just confirm these do not inadvertently contain environment-specific identifiers.
lib/segment/src/common/operation_time_statistics.rs (1)
105-110
: Fix the typo in the function name.The function name has a typo: "historgram" should be "histogram".
-fn anonymize_historgram(histogram: &[(f32, usize)]) -> Vec<(f32, usize)> { +fn anonymize_histogram(histogram: &[(f32, usize)]) -> Vec<(f32, usize)> {Don't forget to update the reference in the annotation on line 53 as well:
-#[anonymize(with = anonymize_historgram)] +#[anonymize(with = anonymize_histogram)]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (26)
Cargo.toml
(1 hunks)lib/api/src/rest/models.rs
(1 hunks)lib/collection/src/collection_manager/optimizers/mod.rs
(2 hunks)lib/collection/src/config.rs
(4 hunks)lib/collection/src/operations/types.rs
(8 hunks)lib/collection/src/optimizers_builder.rs
(2 hunks)lib/collection/src/shards/telemetry.rs
(2 hunks)lib/collection/src/telemetry.rs
(3 hunks)lib/macros/Cargo.toml
(1 hunks)lib/macros/src/anonymize.rs
(1 hunks)lib/macros/src/lib.rs
(1 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/src/common/anonymize.rs
(3 hunks)lib/segment/src/common/operation_time_statistics.rs
(4 hunks)lib/segment/src/index/sparse_index/sparse_index_config.rs
(2 hunks)lib/segment/src/json_path/mod.rs
(1 hunks)lib/segment/src/telemetry.rs
(3 hunks)lib/segment/src/types.rs
(26 hunks)lib/storage/src/types.rs
(5 hunks)src/common/telemetry.rs
(1 hunks)src/common/telemetry_ops/app_telemetry.rs
(4 hunks)src/common/telemetry_ops/cluster_telemetry.rs
(2 hunks)src/common/telemetry_ops/collections_telemetry.rs
(1 hunks)src/common/telemetry_ops/hardware.rs
(1 hunks)src/common/telemetry_ops/memory_telemetry.rs
(1 hunks)src/common/telemetry_ops/requests_telemetry.rs
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
🔇 Additional comments (93)
Cargo.toml (1)
256-256
: LGTM: New macros package added to workspace.The addition of the macros package to the workspace members is correctly implemented.
lib/segment/Cargo.toml (1)
113-113
: LGTM: Correctly added macros dependency.The segment module now depends on the new macros package, which will allow it to use the derive macro for the Anonymize trait.
lib/macros/src/lib.rs (1)
1-11
: LGTM: Well-structured procedural macro implementation.The
derive_anonymize
proc macro is properly implemented with good error handling. It converts the input TokenStream, processes it through the anonymize module, and returns either the result or a compile error.lib/macros/Cargo.toml (1)
1-21
: LGTM: Properly configured proc-macro crate.The Cargo.toml is correctly set up for a procedural macro with appropriate dependencies:
proc-macro = true
is set to indicate this is a procedural macro- Dependencies on proc-macro2, quote, and syn are included
- Workspace lints are properly inherited
This follows Rust best practices for procedural macro crates.
src/common/telemetry.rs (1)
33-36
: LGTM: Proper implementation of the Anonymize macro for TelemetryDataThe code correctly applies the
Anonymize
derive macro to theTelemetryData
struct, with an appropriate#[anonymize(false)]
annotation on theid
field to preserve its value during anonymization. This aligns with the PR objective of replacing manual implementations with the macro.lib/api/src/rest/models.rs (1)
52-54
: LGTM: Correctly implemented Anonymize for HardwareUsageThe implementation properly adds the
Anonymize
derive macro with#[anonymize(false)]
annotation on the entire struct, effectively creating a no-op implementation that preserves all data. This is consistent with the PR's goal of replacing manual implementations with the macro.lib/collection/src/optimizers_builder.rs (2)
5-5
: LGTM: Proper import of the Anonymize traitThe code correctly imports the
Anonymize
trait from the segment module.
24-25
: LGTM: Correctly implemented Anonymize for OptimizersConfigThe implementation properly adds the
Anonymize
derive macro with#[anonymize(false)]
annotation on the entire struct, which ensures configuration data isn't modified during anonymization processes.src/common/telemetry_ops/memory_telemetry.rs (1)
16-17
: LGTM: Properly implemented Anonymize for MemoryTelemetryThe implementation correctly adds the
Anonymize
derive macro with#[anonymize(false)]
annotation on the struct. This ensures memory telemetry data remains intact during anonymization, which is appropriate since this data doesn't contain sensitive information that would need anonymization.src/common/telemetry_ops/hardware.rs (1)
10-10
: Great use of the derivedAnonymize
trait here.By adding
#[derive(Anonymize)]
, the manual anonymization logic is effectively replaced with a concise and maintainable pattern. No issues identified.lib/segment/src/index/sparse_index/sparse_index_config.rs (3)
14-16
: Good adoption of theAnonymize
trait forSparseIndexType
.Deriving the trait streamlines anonymization and is aligned with the rest of the codebase. Looks solid.
46-48
: Consistent usage of derivedAnonymize
forSparseIndexConfig
.This derivation removes manual anonymization code and is consistent with the project’s move toward macro-based solutions.
54-54
: Clear use of#[anonymize(false)]
onfull_scan_threshold
.This choice accurately reflects that the field should not be anonymized, maintaining the data as-is. Implementation is correct.
lib/macros/src/anonymize.rs (3)
1-49
: Robust enumeration of possible attribute forms.The
AnonymizeAttr
enum and parsing logic elegantly handle different forms of the#[anonymize(...)]
attribute. This design thoroughly covers typical usage scenarios (boolean, value, function).
51-73
: Well-structured helper functions for parsing and generating anonymization expressions.
parse_attrs
andanonymize_expr
give clear, modular handling of attributes. This approach keeps the macro codebase highly maintainable and easy to extend.
75-341
: Comprehensive derivation logic with strong test coverage.
- The macro correctly handles structs, tuple structs, and enums, returning an appropriate compiler error for unions.
- The test suite is thorough and ensures accurate expansions for various attribute combinations.
No correctness, performance, or maintainability issues detected.
lib/segment/src/common/anonymize.rs (4)
6-6
: Good addition: Exposing the derive macro.Well done adding the public export of the Anonymize derive macro. This aligns with the PR objective of replacing manual implementations with the macro.
8-42
: Excellent documentation for the derive macro.The comprehensive documentation with usage examples and attribute explanations makes the macro easy to understand and use. This is particularly important for a derive macro that will be widely used across the codebase.
80-92
: Good utility function for collection anonymization.The
anonymize_collection_values
utility function is a nice addition that simplifies anonymizing collections while keeping keys intact, a common use case seen in the PR.
115-119
: Simple implementation for bool is appropriate.The no-op implementation for
bool
makes sense - boolean values typically don't need anonymization since they represent states rather than sensitive data.lib/collection/src/collection_manager/optimizers/mod.rs (2)
115-116
: Correctly applied derive macro with field customization.Good job replacing the manual implementation with the derive macro for
TrackerTelemetry
. The#[anonymize(false)]
annotation on thename
field ensures it remains identifiable in telemetry data.Also applies to: 118-119
171-173
: Appropriate anonymization control for TrackerStatus.The derive implementation with
#[anonymize(false)]
for theCancelled
andError
variants correctly preserves error messages and cancellation reasons, which are important for debugging while not containing sensitive data.Also applies to: 179-182
src/common/telemetry_ops/requests_telemetry.rs (3)
7-7
: Clean import update including the collection utility.Good update to include both the trait and the new utility function in the import statement.
21-24
: Good use of custom anonymization function.The
#[anonymize(with = anonymize_collection_values)]
annotation shows a great use of the custom function feature of the derive macro. This ensures that response keys are preserved while the values are anonymized.
147-151
: Simplified implementation for RequestsTelemetry.Using the derive macro for
RequestsTelemetry
greatly simplifies the code by eliminating the need for a manual implementation.lib/collection/src/config.rs (3)
34-36
: Good use of struct-level anonymization control.Using
#[anonymize(false)]
at the struct level forWalConfig
is an elegant way to disable anonymization for all fields, which is appropriate for configuration that doesn't contain sensitive data.
66-68
: Appropriate derive for ShardingMethod enum.Adding the
Anonymize
derive to theShardingMethod
enum is correct since it's a simple enum with variants that don't contain sensitive data.
76-77
: Well-structured field-level anonymization control.The selective use of
#[anonymize(false)]
annotations for configuration fields likeshard_number
,replication_factor
,write_consistency_factor
, andread_fan_out_factor
is appropriate. These metrics are important for telemetry and monitoring without revealing sensitive data.Also applies to: 85-86, 96-96, 103-103, 110-110
src/common/telemetry_ops/cluster_telemetry.rs (3)
26-32
: Confirm struct-level anonymization behavior.Deriving
Anonymize
along with#[anonymize(false)]
at the struct level indicates that none of this struct’s fields will be anonymized. Ensure this is intentional if any of the fields can contain sensitive information.
50-65
: Validate selective field anonymization.The usage of
#[anonymize(false)]
for numeric fields and#[anonymize(value = None)]
forpeer_id
correctly configures field-level anonymization. If these fields are not sensitive or must remain visible, this approach looks good.
67-80
: Check non-anonymized fields.Adding
#[anonymize(false)]
forpeers
andmetadata
means these fields will remain intact. Confirm that this is desired if containing environment- or cluster-specific details.src/common/telemetry_ops/collections_telemetry.rs (3)
11-16
: DerivingAnonymize
for aggregated metrics.This change cleanly enables anonymization. No issues noted, as the macro-based approach appears consistent with the rest of the code.
18-23
: Adoption ofAnonymize
in telemetry enum.Deriving
Anonymize
on this untagged enum seems correct. No concerns regarding serialization compatibility if the macro is tested.
25-31
: Careful with non-anonymized collection count.Marking
number_of_collections
with#[anonymize(false)]
ensures it remains visible. If you intend to protect collection details, consider revising; otherwise, this is valid.lib/segment/src/json_path/mod.rs (2)
18-22
: Adding anonymization toJsonPath
.Deriving
Anonymize
onJsonPath
is straightforward. Ensure any hashed or masked logic is appropriate if you expect to conceal parts of the path.
24-33
: Non-anonymized indices inJsonPathItem
.The
#[anonymize(false)]
onIndex
indicates you intend to preserve array indices. If indexes can reveal sensitive info, consider anonymizing them. Otherwise, this looks fine.lib/collection/src/shards/telemetry.rs (5)
15-24
: Confirm intended visibility of theid
field.By specifying
#[anonymize(false)]
forid
, you are preserving its original value in telemetry. Verify whether this field might contain sensitive or user-related information. If so, consider anonymizing it or replacing it with a generic placeholder.
15-24
: Good use of#[anonymize(value = HashMap::new())]
.Ensuring
replicate_states
is anonymized with an emptyHashMap
effectively removes sensitive replication details from telemetry. This is a clean and straightforward implementation choice.
26-34
: Shard ID exposure.Similarly to
id
, marking#[anonymize(false)]
forshard_id
might disclose internal system details. If that’s acceptable for your telemetry usage, it’s fine; otherwise, you could mask or remove it.
26-34
: Correct minimization ofpeer_id
information.By mapping
peer_id
toNone
, you avoid revealing potential identifiers of peers in telemetry data, which is likely important for privacy. This approach is consistent and clear.
49-54
: Straightforward derive updates onOptimizerTelemetry
.Adding
Default
andAnonymize
derivations together keeps code clean and consistent with the rest of the implementation. No concerns here.lib/segment/src/telemetry.rs (3)
8-14
: Properly derivingAnonymize
forSegmentTelemetry
.No field-specific annotations here, so the entire struct defaults to a generic anonymization approach. Verify that none of these fields requires special logic or placeholders.
16-30
: Field-level anonymization choices inPayloadIndexTelemetry
.•
#[anonymize(value = None)]
onfield_name
effectively hides the field name.
•#[anonymize(false)]
onhistogram_bucket_size
retains potentially revealing data. Ensure this is acceptable from a privacy standpoint.
39-71
: Selective anonymization inVectorIndexSearchesTelemetry
.Using
#[anonymize(value = None)]
forindex_name
prevents revealing specific vector index details. The other fields appear to hold performance metrics rather than PII, so this approach looks solid.src/common/telemetry_ops/app_telemetry.rs (4)
23-30
: SensibleAnonymize
derivation onAppFeaturesTelemetry
.All fields are boolean flags, so no special anonymization needed. This is clean and consistent.
32-49
: Environmental details kept in plain view.By marking
distribution
,distribution_version
,cores
, andcpu_flags
with#[anonymize(false)]
, you are preserving detailed system environment information. Confirm that this does not conflict with your privacy requirements.
66-72
: EnumCpuEndian
derived withAnonymize
.Exposing CPU endianness is generally low risk. The addition of the derived trait here is straightforward and does not pose concerns.
187-191
: GPU device names remain visible.By marking the
name
field with#[anonymize(false)]
, you preserve details like GPU brand/model in telemetry. Confirm it doesn’t expose sensitive data about the environment.lib/segment/src/common/operation_time_statistics.rs (3)
18-18
: Good use of the derive macro for Anonymize.The
OperationDurationStatistics
struct now derives theAnonymize
trait, which is consistent with the PR's objective of replacing manual implementations with the derive macro.
28-29
: Appropriate anonymization exclusions for metrics.Marking these duration and metric fields with
#[anonymize(false)]
ensures that actual performance metrics remain unchanged during anonymization, which is important for accurate telemetry.Also applies to: 33-34, 38-39, 42-43
53-54
: Good use of custom anonymization function for histograms.Using a custom anonymization function for the histogram is appropriate since it requires special handling to anonymize counts while preserving bucket boundaries.
lib/collection/src/telemetry.rs (5)
15-16
: Good use of the derive macro for Anonymize.The
CollectionTelemetry
struct now derives theAnonymize
trait, which aligns with the PR's goal of replacing manual implementations with derive macros.
18-19
: Appropriate anonymization exclusion for init time.Preserving the actual initialization time during anonymization is important for telemetry purposes.
22-28
: Good anonymization strategy for collections.Setting empty collections for
transfers
,resharding
, andshard_clean_tasks
during anonymization is a good approach to remove potentially sensitive data while maintaining the structure of the telemetry.
62-63
: Good use of derive macro for CollectionConfigTelemetry.The
CollectionConfigTelemetry
struct correctly derives theAnonymize
trait.
73-74
: Appropriate anonymization for UUID.Setting the UUID to
None
during anonymization helps protect unique identifiers that could be used to correlate data.lib/storage/src/types.rs (8)
18-19
: Good use of anonymize_collection_values.Importing
anonymize_collection_values
in addition toAnonymize
provides a convenient way to anonymize collection values.
139-140
: Added Anonymize trait to PeerInfo.The
PeerInfo
struct now derives theAnonymize
trait, consistent with the PR's goal.
147-149
: Appropriate anonymization strategy for RaftInfo.The whole
RaftInfo
struct is marked with#[anonymize(false)]
to preserve all Raft consensus information, which is important for maintaining accurate system telemetry.
168-169
: Added Anonymize trait to StateRole enum.The
StateRole
enum now derives theAnonymize
trait, which will maintain the role information during anonymization.
201-208
: Good targeted anonymization for ClusterInfo.The
ClusterInfo
struct derivesAnonymize
with appropriate field-specific annotations:
peer_id
is preserved with#[anonymize(false)]
peers
uses custom handling withanonymize_collection_values
This allows precise control over which parts are anonymized.
215-216
: Preserving failure data for diagnostics.Keeping
message_send_failures
non-anonymized is appropriate as this information is valuable for diagnosing cluster issues.
220-221
: Added Anonymize trait to ClusterStatus enum.The
ClusterStatus
enum now derives theAnonymize
trait, consistent with the PR's goal.
229-233
: Appropriate anonymization for ConsensusThreadStatus.Marking the
ConsensusThreadStatus
enum with#[anonymize(false)]
ensures that consensus status information is preserved during anonymization, which is essential for system diagnostics.lib/collection/src/operations/types.rs (8)
83-84
: Added Anonymize trait to ShardStatus enum.The
ShardStatus
enum now derives theAnonymize
trait, consistent with the PR's goal.
116-118
: Appropriate error preservation for OptimizersStatus.The
Error
variant ofOptimizersStatus
is marked with#[anonymize(false)]
to preserve error details, which is important for diagnostics while still anonymizing other parts of the enum.Also applies to: 125-126
1470-1473
: Added Anonymize trait to Datatype enum.The
Datatype
enum now derives theAnonymize
trait, maintaining consistency with the PR's objectives.
1492-1497
: Good anonymization strategy for VectorParams.The
VectorParams
struct derivesAnonymize
with#[anonymize(false)]
at the struct level, preserving all vector parameters during anonymization, which is appropriate since this is configuration data rather than user data.
1554-1558
: Added Anonymize trait to Modifier enum.The
Modifier
enum now derives theAnonymize
trait, which adds no-op implementation where.clone()
was previously used.
1565-1568
: Added Anonymize trait to SparseVectorParams struct.The
SparseVectorParams
struct now derives theAnonymize
trait, consistent with the PR's goal of replacing manual implementations.
1587-1590
: Good targeted anonymization for SparseIndexParams.The
SparseIndexParams
struct derivesAnonymize
with thefull_scan_threshold
field marked as#[anonymize(false)]
to preserve this important threshold parameter during anonymization.Also applies to: 1596-1597
1643-1644
: Added Anonymize trait to VectorsConfig enum.The
VectorsConfig
enum now derives theAnonymize
trait, consistent with the PR's goal of replacing manual implementations with derive macros.lib/segment/src/types.rs (21)
212-225
: Good use of the Anonymize derive macroYou've correctly applied the Anonymize derive macro to the Distance enum, which aligns with the PR objective of replacing manual implementations with derived ones.
316-325
: LGTM: Correctly implemented Anonymize for SegmentTypeThe Anonymize trait has been properly added to the SegmentType enum using the derive macro.
328-356
: LGTM: Anonymize trait appropriately derived for payload information structsYou've correctly implemented the Anonymize trait for both PayloadIndexInfo and VectorDataInfo structs.
364-382
: Correct application of Anonymize to SegmentInfoThe SegmentInfo struct now correctly derives the Anonymize trait, which aligns with the PR objective.
494-505
: LGTM: Anonymize implemented for Indexes enumThe Indexes enum now correctly derives the Anonymize trait.
517-520
: Good use of anonymize(false) attributeYou've properly applied the Anonymize trait to the HnswConfig struct while using the anonymize(false) attribute at the struct level to prevent anonymization of the entire configuration.
668-671
: Properly implemented anonymization exclusion for QuantizationConfigYou've correctly applied the Anonymize trait to the QuantizationConfig enum and used anonymize(false) to prevent anonymization.
742-775
: LGTM: Anonymize implementation for StrictModeSparseConfigThe StrictModeSparseConfigOutput and StrictModeSparseOutput structs now correctly derive the Anonymize trait, with max_length field marked to not be anonymized.
804-837
: Correct implementation for StrictModeMultivector anonymizationThe StrictModeMultivectorConfigOutput and StrictModeMultivectorOutput structs correctly derive the Anonymize trait, with max_vectors field properly excluded from anonymization.
969-1058
: Well-implemented anonymization for StrictModeConfigOutputYou've properly added the Anonymize trait to StrictModeConfigOutput and correctly marked numeric fields with anonymize(false) to prevent them from being anonymized.
1128-1140
: LGTM: PayloadStorageType now derives AnonymizeThe PayloadStorageType enum now correctly derives the Anonymize trait.
1148-1158
: Correctly implemented Anonymize for SegmentConfigThe SegmentConfig struct now properly derives the Anonymize trait.
1221-1243
: LGTM: Storage types now derive AnonymizeVectorStorageType now correctly derives the Anonymize trait.
1246-1258
: Proper anonymization for VectorStorageDatatypeVectorStorageDatatype enum now correctly derives the Anonymize trait.
1260-1276
: LGTM: MultiVectorConfig properly derives AnonymizeBoth MultiVectorConfig struct and MultiVectorComparator enum now correctly derive the Anonymize trait.
1289-1308
: Correct implementation for VectorDataConfigThe VectorDataConfig struct now properly derives the Anonymize trait.
1329-1339
: LGTM: SparseVectorStorageType derives AnonymizeThe SparseVectorStorageType enum now correctly derives the Anonymize trait.
1342-1351
: Correct implementation for SparseVectorDataConfigThe SparseVectorDataConfig struct now properly derives the Anonymize trait.
1613-1626
: LGTM: PayloadSchemaType derives AnonymizeThe PayloadSchemaType enum now correctly derives the Anonymize trait.
1649-1652
: Good use of anonymize(false) for PayloadSchemaParamsYou've correctly applied the Anonymize trait to PayloadSchemaParams and used anonymize(false) to prevent anonymization of the enum.
4321-4329
: Appropriate implementation for ShardKeyThe ShardKey enum now correctly derives the Anonymize trait, with the Number variant properly excluded from anonymization with anonymize(false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like hit.
But I wonder what @generall thinks of it 😃
pub peers: Option<HashMap<PeerId, PeerInfo>>, | ||
#[serde(skip_serializing_if = "Option::is_none")] | ||
#[anonymize(false)] | ||
pub metadata: Option<HashMap<String, serde_json::Value>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should anonymize this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Using #[anonymize(value = None)]
is probably good enough to not include it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still prefer to see if there is in metadata, might be indication of a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/common/telemetry_ops/cluster_telemetry.rs (1)
50-62
: Confirm non-anonymized fields.Fields marked with
#[anonymize(false)]
will remain unobscured, which might expose potentially sensitive operational metrics (e.g.,number_of_peers
,term
). Ensure this is aligned with your telemetry privacy requirements and does not conflict with any existing policies around anonymization of cluster internals.lib/segment/src/common/anonymize.rs (1)
129-146
: Consider numeric anonymization forserde_json::Value
.Currently,
serde_json::Value::Number(n)
is returned as-is. If you aim for consistent numeric anonymization (e.g., rounding or hashing) similar tousize
, consider applying a parallel approach here.serde_json::Value::Number(n) => { - serde_json::Value::Number(n.clone()) + // Optionally anonymize numeric values to mirror the approach used for `usize` + match n.as_u64() { + Some(num) => serde_json::Value::Number(serde_json::Number::from(anonymize_usize(num as usize))), + None => serde_json::Value::Number(n.clone()), + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/common/anonymize.rs
(3 hunks)src/common/telemetry_ops/cluster_telemetry.rs
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: test (windows-latest)
- GitHub Check: test-consensus
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (6)
src/common/telemetry_ops/cluster_telemetry.rs (2)
67-75
: Revisitpeers
non-anonymization.By adding
#[anonymize(false)]
to thepeers
field, you leave peer details fully exposed. Re-check if it aligns with prior internal discussions suggesting that peer details may be sensitive and should be replaced or omitted.Please confirm with stakeholders or run the following script to see if there's any prior usage or comments referencing
peers
anonymization:#!/bin/bash rg -A 3 'peers' src/common/telemetry_ops/cluster_telemetry.rs
26-27
:❓ Verification inconclusive
Validate struct-level
#[anonymize(false)]
usage.Applying
#[anonymize(false)]
at the struct level will prevent anonymization for all fields. Verify this is the intended behavior, as this effectively leaves every field (including nested structures) uncloaked.Run the following command to identify any field-level attributes in conflict with this struct-level directive:
🏁 Script executed:
#!/bin/bash # Searching for conflicting field-level anonymize directives within this struct rg -A 5 --no-heading '#\[anonymize\(' src/common/telemetry_ops/cluster_telemetry.rsLength of output: 879
Confirm Struct Anonymization Behavior
The struct in
src/common/telemetry_ops/cluster_telemetry.rs
is marked with#[anonymize(false)]
, and the output confirms that most fields are redundantly annotated with#[anonymize(false)]
. This setup disables anonymization for all fields—including nested structures—which appears intentional. However, note that thepeer_id
field uses#[anonymize(value = None)]
, which deviates from the others. Please verify that this mixed usage is deliberate. If the struct-level directive is meant to blanket-disable anonymization, consider removing the redundant field-level attributes for clarity.lib/segment/src/common/anonymize.rs (4)
6-6
: Re-export ofAnonymize
macro looks good.The
pub use macros::Anonymize;
line cleanly exposes the macro to calling modules.
8-42
: Documentation improvements look solid.The inline usage examples, attribute descriptions, and thorough instructions clarify how to apply the macro.
80-93
: Generic helper for value anonymization.
anonymize_collection_values()
properly transforms only values, retaining the original keys. This meets typical anonymization use-cases.
115-119
: Boolean anonymization approved.Returning the original boolean is consistent with preserving logical states while anonymizing other data. Confirm no contradictory internal policy requiring booleans to be obscured.
* Add Anonymize macro * Use Anonymize macro * anonymize metadata --------- Co-authored-by: generall <andrey@vasnetsov.com>
This PR introduces a derive macro for the
Anonymize
trait and replaces most of the hand-written implementations with it.No change in the logic/behavior are made.
Showcase
So, instead of having this:
We will have this:
No-op implementations
This PR adds a bunch of no-op
Anonymize
implementations in places where.clone()
was used before. See #6115 description for a rationale.E.g., previously:
After this PR: