Skip to content

Conversation

xzfc
Copy link
Contributor

@xzfc xzfc commented Mar 12, 2025

This PR introduces a derive macro for the Anonymize trait and replaces most of the hand-written implementations with it.

No change in the logic/behavior are made.

Showcase

So, instead of having this:

#[derive(Serialize, Clone, Debug, JsonSchema)]
pub struct SegmentTelemetry {
    pub info: SegmentInfo,
    pub config: SegmentConfig,
    pub vector_index_searches: Vec<VectorIndexSearchesTelemetry>,
    pub payload_field_indices: Vec<PayloadIndexTelemetry>,
}

impl Anonymize for SegmentTelemetry {
    fn anonymize(&self) -> Self {
        Self {
            info: self.info.anonymize(),
            config: self.config.anonymize(),
            vector_index_searches: self.vector_index_searches.anonymize(),
            payload_field_indices: self.payload_field_indices.anonymize(),
        }
    }
}

We will have this:

//                                            ↓↓↓↓↓↓↓↓↓
#[derive(Serialize, Clone, Debug, JsonSchema, Anonymize)]
pub struct SegmentTelemetry {
    pub info: SegmentInfo,
    pub config: SegmentConfig,
    pub vector_index_searches: Vec<VectorIndexSearchesTelemetry>,
    pub payload_field_indices: Vec<PayloadIndexTelemetry>,
}

No-op implementations

This PR adds a bunch of no-op Anonymize implementations in places where .clone() was used before. See #6115 description for a rationale.

E.g., previously:

pub struct SparseVectorParams {
    pub index: Option<SparseIndexParams>,
    pub modifier: Option<Modifier>,
}
pub enum Modifier {
    None,
    Idf,
}

impl Anonymize for SparseVectorParams {
    fn anonymize(&self) -> Self {
        Self {
            index: self.index.anonymize(),
            modifier: self.modifier.clone(), // clone
        }
    }
}

After this PR:

#[derive(Anonymize)] // added
pub struct SparseVectorParams {
    pub index: Option<SparseIndexParams>,
    pub modifier: Option<Modifier>,
}
#[derive(Anonymize)] // added
pub enum Modifier {
    None,
    Idf,
}

// Generated code
impl Anonymize for SparseVectorParams {
    fn anonymize(&self) -> Self {
        Self {
            index: self.index.anonymize(),
            modifier: self.modifier.anonymize(), // a call to no-op anonymize implementation
        }
    }
}
// Generated code (no-op)
impl Anonymize for Modifier {
    fn anonymize(&self) -> Self {
        match self {
            Self::None => Self::None,
            Self::Idf => Self::Idf,
        }
    }
}

@xzfc xzfc requested review from timvisee and generall March 12, 2025 18:27
Copy link
Contributor

coderabbitai bot commented Mar 12, 2025

📝 Walkthrough

Walkthrough

This pull request integrates a unified anonymization approach across the project. A new procedural macro for the Anonymize trait has been introduced in a newly added macros package. Multiple data structures—including structs, enums, and configuration types—across various modules now derive the Anonymize trait with field-specific annotations (e.g., #[anonymize(false)] or custom functions). Custom implementations of the trait were removed in favor of this streamlined, derive-based approach. Additionally, the workspace configuration was updated to include the new macros package, and a dependency on it was added for the segment module. These changes span updates in API models, telemetry, storage types, and collection configurations.

Suggested reviewers

  • xzfc

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/common/telemetry_ops/requests_telemetry.rs (1)

171-181: Retained manual implementation for WebApiTelemetry.

The manual implementation for WebApiTelemetry is retained but now uses the anonymize_collection_values utility function. Consider if this could also be replaced with a derive implementation using attributes.

Consider replacing this manual implementation with a derive-based approach:

-impl Anonymize for WebApiTelemetry {
-    fn anonymize(&self) -> Self {
-        let responses = self
-            .responses
-            .iter()
-            .map(|(key, value)| (key.clone(), anonymize_collection_values(value)))
-            .collect();
-
-        WebApiTelemetry { responses }
-    }
-}

And add the derive and attribute to the struct:

-#[derive(Serialize, Clone, Default, Debug, JsonSchema)]
+#[derive(Serialize, Clone, Default, Debug, JsonSchema, Anonymize)]
 pub struct WebApiTelemetry {
+    #[anonymize(with = anonymize_collection_values)]
     pub responses: HashMap<String, HashMap<HttpStatusCode, OperationDurationStatistics>>,
 }
lib/collection/src/shards/telemetry.rs (1)

36-47: Double-check the sensitivity of variant_name.

Using #[anonymize(false)] on variant_name might inadvertently leak environment or runtime details. If that is intentional, no action needed; otherwise, consider anonymizing it.

src/common/telemetry_ops/app_telemetry.rs (1)

51-65: Explicit non-anonymization of name and version.

Retaining application name and version is usually fine for telemetry. Just confirm these do not inadvertently contain environment-specific identifiers.

lib/segment/src/common/operation_time_statistics.rs (1)

105-110: Fix the typo in the function name.

The function name has a typo: "historgram" should be "histogram".

-fn anonymize_historgram(histogram: &[(f32, usize)]) -> Vec<(f32, usize)> {
+fn anonymize_histogram(histogram: &[(f32, usize)]) -> Vec<(f32, usize)> {

Don't forget to update the reference in the annotation on line 53 as well:

-#[anonymize(with = anonymize_historgram)]
+#[anonymize(with = anonymize_histogram)]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a52a87 and db7cabe.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • Cargo.toml (1 hunks)
  • lib/api/src/rest/models.rs (1 hunks)
  • lib/collection/src/collection_manager/optimizers/mod.rs (2 hunks)
  • lib/collection/src/config.rs (4 hunks)
  • lib/collection/src/operations/types.rs (8 hunks)
  • lib/collection/src/optimizers_builder.rs (2 hunks)
  • lib/collection/src/shards/telemetry.rs (2 hunks)
  • lib/collection/src/telemetry.rs (3 hunks)
  • lib/macros/Cargo.toml (1 hunks)
  • lib/macros/src/anonymize.rs (1 hunks)
  • lib/macros/src/lib.rs (1 hunks)
  • lib/segment/Cargo.toml (1 hunks)
  • lib/segment/src/common/anonymize.rs (3 hunks)
  • lib/segment/src/common/operation_time_statistics.rs (4 hunks)
  • lib/segment/src/index/sparse_index/sparse_index_config.rs (2 hunks)
  • lib/segment/src/json_path/mod.rs (1 hunks)
  • lib/segment/src/telemetry.rs (3 hunks)
  • lib/segment/src/types.rs (26 hunks)
  • lib/storage/src/types.rs (5 hunks)
  • src/common/telemetry.rs (1 hunks)
  • src/common/telemetry_ops/app_telemetry.rs (4 hunks)
  • src/common/telemetry_ops/cluster_telemetry.rs (2 hunks)
  • src/common/telemetry_ops/collections_telemetry.rs (1 hunks)
  • src/common/telemetry_ops/hardware.rs (1 hunks)
  • src/common/telemetry_ops/memory_telemetry.rs (1 hunks)
  • src/common/telemetry_ops/requests_telemetry.rs (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (windows-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
🔇 Additional comments (93)
Cargo.toml (1)

256-256: LGTM: New macros package added to workspace.

The addition of the macros package to the workspace members is correctly implemented.

lib/segment/Cargo.toml (1)

113-113: LGTM: Correctly added macros dependency.

The segment module now depends on the new macros package, which will allow it to use the derive macro for the Anonymize trait.

lib/macros/src/lib.rs (1)

1-11: LGTM: Well-structured procedural macro implementation.

The derive_anonymize proc macro is properly implemented with good error handling. It converts the input TokenStream, processes it through the anonymize module, and returns either the result or a compile error.

lib/macros/Cargo.toml (1)

1-21: LGTM: Properly configured proc-macro crate.

The Cargo.toml is correctly set up for a procedural macro with appropriate dependencies:

  • proc-macro = true is set to indicate this is a procedural macro
  • Dependencies on proc-macro2, quote, and syn are included
  • Workspace lints are properly inherited

This follows Rust best practices for procedural macro crates.

src/common/telemetry.rs (1)

33-36: LGTM: Proper implementation of the Anonymize macro for TelemetryData

The code correctly applies the Anonymize derive macro to the TelemetryData struct, with an appropriate #[anonymize(false)] annotation on the id field to preserve its value during anonymization. This aligns with the PR objective of replacing manual implementations with the macro.

lib/api/src/rest/models.rs (1)

52-54: LGTM: Correctly implemented Anonymize for HardwareUsage

The implementation properly adds the Anonymize derive macro with #[anonymize(false)] annotation on the entire struct, effectively creating a no-op implementation that preserves all data. This is consistent with the PR's goal of replacing manual implementations with the macro.

lib/collection/src/optimizers_builder.rs (2)

5-5: LGTM: Proper import of the Anonymize trait

The code correctly imports the Anonymize trait from the segment module.


24-25: LGTM: Correctly implemented Anonymize for OptimizersConfig

The implementation properly adds the Anonymize derive macro with #[anonymize(false)] annotation on the entire struct, which ensures configuration data isn't modified during anonymization processes.

src/common/telemetry_ops/memory_telemetry.rs (1)

16-17: LGTM: Properly implemented Anonymize for MemoryTelemetry

The implementation correctly adds the Anonymize derive macro with #[anonymize(false)] annotation on the struct. This ensures memory telemetry data remains intact during anonymization, which is appropriate since this data doesn't contain sensitive information that would need anonymization.

src/common/telemetry_ops/hardware.rs (1)

10-10: Great use of the derived Anonymize trait here.

By adding #[derive(Anonymize)], the manual anonymization logic is effectively replaced with a concise and maintainable pattern. No issues identified.

lib/segment/src/index/sparse_index/sparse_index_config.rs (3)

14-16: Good adoption of the Anonymize trait for SparseIndexType.

Deriving the trait streamlines anonymization and is aligned with the rest of the codebase. Looks solid.


46-48: Consistent usage of derived Anonymize for SparseIndexConfig.

This derivation removes manual anonymization code and is consistent with the project’s move toward macro-based solutions.


54-54: Clear use of #[anonymize(false)] on full_scan_threshold.

This choice accurately reflects that the field should not be anonymized, maintaining the data as-is. Implementation is correct.

lib/macros/src/anonymize.rs (3)

1-49: Robust enumeration of possible attribute forms.

The AnonymizeAttr enum and parsing logic elegantly handle different forms of the #[anonymize(...)] attribute. This design thoroughly covers typical usage scenarios (boolean, value, function).


51-73: Well-structured helper functions for parsing and generating anonymization expressions.

parse_attrs and anonymize_expr give clear, modular handling of attributes. This approach keeps the macro codebase highly maintainable and easy to extend.


75-341: Comprehensive derivation logic with strong test coverage.

  1. The macro correctly handles structs, tuple structs, and enums, returning an appropriate compiler error for unions.
  2. The test suite is thorough and ensures accurate expansions for various attribute combinations.

No correctness, performance, or maintainability issues detected.

lib/segment/src/common/anonymize.rs (4)

6-6: Good addition: Exposing the derive macro.

Well done adding the public export of the Anonymize derive macro. This aligns with the PR objective of replacing manual implementations with the macro.


8-42: Excellent documentation for the derive macro.

The comprehensive documentation with usage examples and attribute explanations makes the macro easy to understand and use. This is particularly important for a derive macro that will be widely used across the codebase.


80-92: Good utility function for collection anonymization.

The anonymize_collection_values utility function is a nice addition that simplifies anonymizing collections while keeping keys intact, a common use case seen in the PR.


115-119: Simple implementation for bool is appropriate.

The no-op implementation for bool makes sense - boolean values typically don't need anonymization since they represent states rather than sensitive data.

lib/collection/src/collection_manager/optimizers/mod.rs (2)

115-116: Correctly applied derive macro with field customization.

Good job replacing the manual implementation with the derive macro for TrackerTelemetry. The #[anonymize(false)] annotation on the name field ensures it remains identifiable in telemetry data.

Also applies to: 118-119


171-173: Appropriate anonymization control for TrackerStatus.

The derive implementation with #[anonymize(false)] for the Cancelled and Error variants correctly preserves error messages and cancellation reasons, which are important for debugging while not containing sensitive data.

Also applies to: 179-182

src/common/telemetry_ops/requests_telemetry.rs (3)

7-7: Clean import update including the collection utility.

Good update to include both the trait and the new utility function in the import statement.


21-24: Good use of custom anonymization function.

The #[anonymize(with = anonymize_collection_values)] annotation shows a great use of the custom function feature of the derive macro. This ensures that response keys are preserved while the values are anonymized.


147-151: Simplified implementation for RequestsTelemetry.

Using the derive macro for RequestsTelemetry greatly simplifies the code by eliminating the need for a manual implementation.

lib/collection/src/config.rs (3)

34-36: Good use of struct-level anonymization control.

Using #[anonymize(false)] at the struct level for WalConfig is an elegant way to disable anonymization for all fields, which is appropriate for configuration that doesn't contain sensitive data.


66-68: Appropriate derive for ShardingMethod enum.

Adding the Anonymize derive to the ShardingMethod enum is correct since it's a simple enum with variants that don't contain sensitive data.


76-77: Well-structured field-level anonymization control.

The selective use of #[anonymize(false)] annotations for configuration fields like shard_number, replication_factor, write_consistency_factor, and read_fan_out_factor is appropriate. These metrics are important for telemetry and monitoring without revealing sensitive data.

Also applies to: 85-86, 96-96, 103-103, 110-110

src/common/telemetry_ops/cluster_telemetry.rs (3)

26-32: Confirm struct-level anonymization behavior.

Deriving Anonymize along with #[anonymize(false)] at the struct level indicates that none of this struct’s fields will be anonymized. Ensure this is intentional if any of the fields can contain sensitive information.


50-65: Validate selective field anonymization.

The usage of #[anonymize(false)] for numeric fields and #[anonymize(value = None)] for peer_id correctly configures field-level anonymization. If these fields are not sensitive or must remain visible, this approach looks good.


67-80: Check non-anonymized fields.

Adding #[anonymize(false)] for peers and metadata means these fields will remain intact. Confirm that this is desired if containing environment- or cluster-specific details.

src/common/telemetry_ops/collections_telemetry.rs (3)

11-16: Deriving Anonymize for aggregated metrics.

This change cleanly enables anonymization. No issues noted, as the macro-based approach appears consistent with the rest of the code.


18-23: Adoption of Anonymize in telemetry enum.

Deriving Anonymize on this untagged enum seems correct. No concerns regarding serialization compatibility if the macro is tested.


25-31: Careful with non-anonymized collection count.

Marking number_of_collections with #[anonymize(false)] ensures it remains visible. If you intend to protect collection details, consider revising; otherwise, this is valid.

lib/segment/src/json_path/mod.rs (2)

18-22: Adding anonymization to JsonPath.

Deriving Anonymize on JsonPath is straightforward. Ensure any hashed or masked logic is appropriate if you expect to conceal parts of the path.


24-33: Non-anonymized indices in JsonPathItem.

The #[anonymize(false)] on Index indicates you intend to preserve array indices. If indexes can reveal sensitive info, consider anonymizing them. Otherwise, this looks fine.

lib/collection/src/shards/telemetry.rs (5)

15-24: Confirm intended visibility of the id field.

By specifying #[anonymize(false)] for id, you are preserving its original value in telemetry. Verify whether this field might contain sensitive or user-related information. If so, consider anonymizing it or replacing it with a generic placeholder.


15-24: Good use of #[anonymize(value = HashMap::new())].

Ensuring replicate_states is anonymized with an empty HashMap effectively removes sensitive replication details from telemetry. This is a clean and straightforward implementation choice.


26-34: Shard ID exposure.

Similarly to id, marking #[anonymize(false)] for shard_id might disclose internal system details. If that’s acceptable for your telemetry usage, it’s fine; otherwise, you could mask or remove it.


26-34: Correct minimization of peer_id information.

By mapping peer_id to None, you avoid revealing potential identifiers of peers in telemetry data, which is likely important for privacy. This approach is consistent and clear.


49-54: Straightforward derive updates on OptimizerTelemetry.

Adding Default and Anonymize derivations together keeps code clean and consistent with the rest of the implementation. No concerns here.

lib/segment/src/telemetry.rs (3)

8-14: Properly deriving Anonymize for SegmentTelemetry.

No field-specific annotations here, so the entire struct defaults to a generic anonymization approach. Verify that none of these fields requires special logic or placeholders.


16-30: Field-level anonymization choices in PayloadIndexTelemetry.

#[anonymize(value = None)] on field_name effectively hides the field name.
#[anonymize(false)] on histogram_bucket_size retains potentially revealing data. Ensure this is acceptable from a privacy standpoint.


39-71: Selective anonymization in VectorIndexSearchesTelemetry.

Using #[anonymize(value = None)] for index_name prevents revealing specific vector index details. The other fields appear to hold performance metrics rather than PII, so this approach looks solid.

src/common/telemetry_ops/app_telemetry.rs (4)

23-30: Sensible Anonymize derivation on AppFeaturesTelemetry.

All fields are boolean flags, so no special anonymization needed. This is clean and consistent.


32-49: Environmental details kept in plain view.

By marking distribution, distribution_version, cores, and cpu_flags with #[anonymize(false)], you are preserving detailed system environment information. Confirm that this does not conflict with your privacy requirements.


66-72: Enum CpuEndian derived with Anonymize.

Exposing CPU endianness is generally low risk. The addition of the derived trait here is straightforward and does not pose concerns.


187-191: GPU device names remain visible.

By marking the name field with #[anonymize(false)], you preserve details like GPU brand/model in telemetry. Confirm it doesn’t expose sensitive data about the environment.

lib/segment/src/common/operation_time_statistics.rs (3)

18-18: Good use of the derive macro for Anonymize.

The OperationDurationStatistics struct now derives the Anonymize trait, which is consistent with the PR's objective of replacing manual implementations with the derive macro.


28-29: Appropriate anonymization exclusions for metrics.

Marking these duration and metric fields with #[anonymize(false)] ensures that actual performance metrics remain unchanged during anonymization, which is important for accurate telemetry.

Also applies to: 33-34, 38-39, 42-43


53-54: Good use of custom anonymization function for histograms.

Using a custom anonymization function for the histogram is appropriate since it requires special handling to anonymize counts while preserving bucket boundaries.

lib/collection/src/telemetry.rs (5)

15-16: Good use of the derive macro for Anonymize.

The CollectionTelemetry struct now derives the Anonymize trait, which aligns with the PR's goal of replacing manual implementations with derive macros.


18-19: Appropriate anonymization exclusion for init time.

Preserving the actual initialization time during anonymization is important for telemetry purposes.


22-28: Good anonymization strategy for collections.

Setting empty collections for transfers, resharding, and shard_clean_tasks during anonymization is a good approach to remove potentially sensitive data while maintaining the structure of the telemetry.


62-63: Good use of derive macro for CollectionConfigTelemetry.

The CollectionConfigTelemetry struct correctly derives the Anonymize trait.


73-74: Appropriate anonymization for UUID.

Setting the UUID to None during anonymization helps protect unique identifiers that could be used to correlate data.

lib/storage/src/types.rs (8)

18-19: Good use of anonymize_collection_values.

Importing anonymize_collection_values in addition to Anonymize provides a convenient way to anonymize collection values.


139-140: Added Anonymize trait to PeerInfo.

The PeerInfo struct now derives the Anonymize trait, consistent with the PR's goal.


147-149: Appropriate anonymization strategy for RaftInfo.

The whole RaftInfo struct is marked with #[anonymize(false)] to preserve all Raft consensus information, which is important for maintaining accurate system telemetry.


168-169: Added Anonymize trait to StateRole enum.

The StateRole enum now derives the Anonymize trait, which will maintain the role information during anonymization.


201-208: Good targeted anonymization for ClusterInfo.

The ClusterInfo struct derives Anonymize with appropriate field-specific annotations:

  • peer_id is preserved with #[anonymize(false)]
  • peers uses custom handling with anonymize_collection_values

This allows precise control over which parts are anonymized.


215-216: Preserving failure data for diagnostics.

Keeping message_send_failures non-anonymized is appropriate as this information is valuable for diagnosing cluster issues.


220-221: Added Anonymize trait to ClusterStatus enum.

The ClusterStatus enum now derives the Anonymize trait, consistent with the PR's goal.


229-233: Appropriate anonymization for ConsensusThreadStatus.

Marking the ConsensusThreadStatus enum with #[anonymize(false)] ensures that consensus status information is preserved during anonymization, which is essential for system diagnostics.

lib/collection/src/operations/types.rs (8)

83-84: Added Anonymize trait to ShardStatus enum.

The ShardStatus enum now derives the Anonymize trait, consistent with the PR's goal.


116-118: Appropriate error preservation for OptimizersStatus.

The Error variant of OptimizersStatus is marked with #[anonymize(false)] to preserve error details, which is important for diagnostics while still anonymizing other parts of the enum.

Also applies to: 125-126


1470-1473: Added Anonymize trait to Datatype enum.

The Datatype enum now derives the Anonymize trait, maintaining consistency with the PR's objectives.


1492-1497: Good anonymization strategy for VectorParams.

The VectorParams struct derives Anonymize with #[anonymize(false)] at the struct level, preserving all vector parameters during anonymization, which is appropriate since this is configuration data rather than user data.


1554-1558: Added Anonymize trait to Modifier enum.

The Modifier enum now derives the Anonymize trait, which adds no-op implementation where .clone() was previously used.


1565-1568: Added Anonymize trait to SparseVectorParams struct.

The SparseVectorParams struct now derives the Anonymize trait, consistent with the PR's goal of replacing manual implementations.


1587-1590: Good targeted anonymization for SparseIndexParams.

The SparseIndexParams struct derives Anonymize with the full_scan_threshold field marked as #[anonymize(false)] to preserve this important threshold parameter during anonymization.

Also applies to: 1596-1597


1643-1644: Added Anonymize trait to VectorsConfig enum.

The VectorsConfig enum now derives the Anonymize trait, consistent with the PR's goal of replacing manual implementations with derive macros.

lib/segment/src/types.rs (21)

212-225: Good use of the Anonymize derive macro

You've correctly applied the Anonymize derive macro to the Distance enum, which aligns with the PR objective of replacing manual implementations with derived ones.


316-325: LGTM: Correctly implemented Anonymize for SegmentType

The Anonymize trait has been properly added to the SegmentType enum using the derive macro.


328-356: LGTM: Anonymize trait appropriately derived for payload information structs

You've correctly implemented the Anonymize trait for both PayloadIndexInfo and VectorDataInfo structs.


364-382: Correct application of Anonymize to SegmentInfo

The SegmentInfo struct now correctly derives the Anonymize trait, which aligns with the PR objective.


494-505: LGTM: Anonymize implemented for Indexes enum

The Indexes enum now correctly derives the Anonymize trait.


517-520: Good use of anonymize(false) attribute

You've properly applied the Anonymize trait to the HnswConfig struct while using the anonymize(false) attribute at the struct level to prevent anonymization of the entire configuration.


668-671: Properly implemented anonymization exclusion for QuantizationConfig

You've correctly applied the Anonymize trait to the QuantizationConfig enum and used anonymize(false) to prevent anonymization.


742-775: LGTM: Anonymize implementation for StrictModeSparseConfig

The StrictModeSparseConfigOutput and StrictModeSparseOutput structs now correctly derive the Anonymize trait, with max_length field marked to not be anonymized.


804-837: Correct implementation for StrictModeMultivector anonymization

The StrictModeMultivectorConfigOutput and StrictModeMultivectorOutput structs correctly derive the Anonymize trait, with max_vectors field properly excluded from anonymization.


969-1058: Well-implemented anonymization for StrictModeConfigOutput

You've properly added the Anonymize trait to StrictModeConfigOutput and correctly marked numeric fields with anonymize(false) to prevent them from being anonymized.


1128-1140: LGTM: PayloadStorageType now derives Anonymize

The PayloadStorageType enum now correctly derives the Anonymize trait.


1148-1158: Correctly implemented Anonymize for SegmentConfig

The SegmentConfig struct now properly derives the Anonymize trait.


1221-1243: LGTM: Storage types now derive Anonymize

VectorStorageType now correctly derives the Anonymize trait.


1246-1258: Proper anonymization for VectorStorageDatatype

VectorStorageDatatype enum now correctly derives the Anonymize trait.


1260-1276: LGTM: MultiVectorConfig properly derives Anonymize

Both MultiVectorConfig struct and MultiVectorComparator enum now correctly derive the Anonymize trait.


1289-1308: Correct implementation for VectorDataConfig

The VectorDataConfig struct now properly derives the Anonymize trait.


1329-1339: LGTM: SparseVectorStorageType derives Anonymize

The SparseVectorStorageType enum now correctly derives the Anonymize trait.


1342-1351: Correct implementation for SparseVectorDataConfig

The SparseVectorDataConfig struct now properly derives the Anonymize trait.


1613-1626: LGTM: PayloadSchemaType derives Anonymize

The PayloadSchemaType enum now correctly derives the Anonymize trait.


1649-1652: Good use of anonymize(false) for PayloadSchemaParams

You've correctly applied the Anonymize trait to PayloadSchemaParams and used anonymize(false) to prevent anonymization of the enum.


4321-4329: Appropriate implementation for ShardKey

The ShardKey enum now correctly derives the Anonymize trait, with the Number variant properly excluded from anonymization with anonymize(false).

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like hit.

But I wonder what @generall thinks of it 😃

pub peers: Option<HashMap<PeerId, PeerInfo>>,
#[serde(skip_serializing_if = "Option::is_none")]
#[anonymize(false)]
pub metadata: Option<HashMap<String, serde_json::Value>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should anonymize this as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Using #[anonymize(value = None)] is probably good enough to not include it at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer to see if there is in metadata, might be indication of a problem

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/common/telemetry_ops/cluster_telemetry.rs (1)

50-62: Confirm non-anonymized fields.

Fields marked with #[anonymize(false)] will remain unobscured, which might expose potentially sensitive operational metrics (e.g., number_of_peers, term). Ensure this is aligned with your telemetry privacy requirements and does not conflict with any existing policies around anonymization of cluster internals.

lib/segment/src/common/anonymize.rs (1)

129-146: Consider numeric anonymization for serde_json::Value.

Currently, serde_json::Value::Number(n) is returned as-is. If you aim for consistent numeric anonymization (e.g., rounding or hashing) similar to usize, consider applying a parallel approach here.

 serde_json::Value::Number(n) => {
-    serde_json::Value::Number(n.clone())
+    // Optionally anonymize numeric values to mirror the approach used for `usize`
+    match n.as_u64() {
+        Some(num) => serde_json::Value::Number(serde_json::Number::from(anonymize_usize(num as usize))),
+        None => serde_json::Value::Number(n.clone()),
+    }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a26fcfb and 4ee4e5c.

📒 Files selected for processing (2)
  • lib/segment/src/common/anonymize.rs (3 hunks)
  • src/common/telemetry_ops/cluster_telemetry.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test (macos-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test (windows-latest)
  • GitHub Check: test-consensus
  • GitHub Check: test (ubuntu-latest)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (6)
src/common/telemetry_ops/cluster_telemetry.rs (2)

67-75: Revisit peers non-anonymization.

By adding #[anonymize(false)] to the peers field, you leave peer details fully exposed. Re-check if it aligns with prior internal discussions suggesting that peer details may be sensitive and should be replaced or omitted.

Please confirm with stakeholders or run the following script to see if there's any prior usage or comments referencing peers anonymization:

#!/bin/bash
rg -A 3 'peers' src/common/telemetry_ops/cluster_telemetry.rs

26-27:

❓ Verification inconclusive

Validate struct-level #[anonymize(false)] usage.

Applying #[anonymize(false)] at the struct level will prevent anonymization for all fields. Verify this is the intended behavior, as this effectively leaves every field (including nested structures) uncloaked.

Run the following command to identify any field-level attributes in conflict with this struct-level directive:


🏁 Script executed:

#!/bin/bash
# Searching for conflicting field-level anonymize directives within this struct
rg -A 5 --no-heading '#\[anonymize\(' src/common/telemetry_ops/cluster_telemetry.rs

Length of output: 879


Confirm Struct Anonymization Behavior

The struct in src/common/telemetry_ops/cluster_telemetry.rs is marked with #[anonymize(false)], and the output confirms that most fields are redundantly annotated with #[anonymize(false)]. This setup disables anonymization for all fields—including nested structures—which appears intentional. However, note that the peer_id field uses #[anonymize(value = None)], which deviates from the others. Please verify that this mixed usage is deliberate. If the struct-level directive is meant to blanket-disable anonymization, consider removing the redundant field-level attributes for clarity.

lib/segment/src/common/anonymize.rs (4)

6-6: Re-export of Anonymize macro looks good.

The pub use macros::Anonymize; line cleanly exposes the macro to calling modules.


8-42: Documentation improvements look solid.

The inline usage examples, attribute descriptions, and thorough instructions clarify how to apply the macro.


80-93: Generic helper for value anonymization.

anonymize_collection_values() properly transforms only values, retaining the original keys. This meets typical anonymization use-cases.


115-119: Boolean anonymization approved.

Returning the original boolean is consistent with preserving logical states while anonymizing other data. Confirm no contradictory internal policy requiring booleans to be obscured.

@generall generall merged commit 70c4c5a into dev Mar 13, 2025
17 checks passed
@generall generall deleted the anonymize-macro branch March 13, 2025 11:07
timvisee pushed a commit that referenced this pull request Mar 21, 2025
* Add Anonymize macro

* Use Anonymize macro

* anonymize metadata

---------

Co-authored-by: generall <andrey@vasnetsov.com>
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants