Skip to content

Reduce cardinality of metrics exposed by Mimir #1750

@pracucci

Description

@pracucci

@colega did a great work to analyze Mimir metrics cardinality and which metrics are effectively used by any of our dashboards, alerts or manual queries issued at Grafana Labs. I took a quick look at the report of both used and unused metrics and I think we have room to reduce the cardinality of metrics exposed by Mimir.

Mimir metrics are typically low cardinality, unless you run very large Mimir clusters and/or with a very large number of tenants. Below I'm sharing some ideas on how we could reduce them.

cortex_ring_tokens_owned

  • Exposed by the ring client
  • For each ring member, it exposes how many tokens (ring virtual nodes) that member has registered in the ring (eg. 512)

Proposal: remove the metric (I think it's useless).

cortex_ring_member_ownership_percent

  • Exposed by the ring client
  • For each ring member, it exposes the % of tokens owned in the ring

Proposal: the same information is available in the ring UI page, I think we can just remove the metric

cortex_alertmanager_notification_requests_total and cortex_alertmanager_notification_requests_failed_total

  • Exposed by the Alertmanager
  • For each user and receiver integration (even if it's not used!) it exposes the number of notification requests sent / failed

Proposal: do not expose it for unused receivers. We could propose a change to not initialize them in Prometheus Alertmanager or do a trick to not expose these metrics if their value is 0 from Mimir (we remap them)

The same could be done for all other Alertmanager counters having both user and integration label, like cortex_alertmanager_notifications_total.

cortex_distributor_ingester_queries_total and cortex_distributor_ingester_query_failures_total

  • Exposed by querier and ruler
  • Exposes the number of queries sent to ingesters (and failed ones)

Proposal: remove the metric (the same can be inferred by the generic grpc requests by route metric).

cortex_alertmanager_alerts_insert_limited_total

This and other Alertmanager counters with only the user label are exposed for all tenants regardless any value has ever been tracked (so even if the counter value is 0).
If it was a normal CounterVec they wouldn't, but in this case they are because they've been remapped by Mimir.

Proposal: improve remapping logic for counters, to optionally allow to not expose a counter if value is 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions