Skip to content

CFP: Improvements in hubble metrics cardinality #23162

@marqc

Description

@marqc

Cilium Feature Proposal/Improvement

Improvements in hubble metrics cardinality proposals:

  1. introduce TTL on metrics
  2. introduce limit on total number of metric series
  3. clean up metrics on pod deletion
  4. introduce label or annotation to exclude certain targets/pods from collecting metrics

Is your feature request related to a problem?

I'm currently examining potential hubble metrics cardinality issues, that may lead to OOM errors on either cilium-agent or metrics collector (in case of too many metrics are getting exposed by cilium agent).

Cilium-agent collects metrics in standard prometheus golang_client counters and depending on source & destination context labels metrics with pod-name (or cilium identity). Over time pods get deleted and new ones are created. Prometheus registry still keeps track of counters labeled with deleted pods. If cilium-agent doesn't get restarted, it can lead to OOM on cilium-agent or metrics collector. The problem is mostly visible when pods live for a very short time, eg. when heavy utilizing cronjobs or knative.

The other failing scenario is when a workload/pod behaves "strange". The port scanner may be an example. Lets say we have a pod, that periodically scans open ports of other pods and clusters to report potential security vulnerabilities. Each scan attempt may generate thousands of dataseries with simple nc -z -v <TARGET_IP> 1-32000. It is related to the port-distribution metric https://github.com/cilium/cilium/blob/master/pkg/hubble/metrics/port-distribution/handler.go

Describe the feature you'd like

I can think of a few mitigations and improvements and would love to hear your feedback.

1. introduce TTL on metrics

Keep an additional map of the last write timestamp of each data series and remove metrics that haven't been updated in specified TTL. This will mitigate the problem of forever keeping dataseries of non-existing pods.
It can remove data series of existing posd when there is no frequent traffic.

2. introduce limit on total number of metric series

Introduce limit that will reject creating new metric series when the configured threshold is reached. In combination with proposal (1) and (3) the number of data series may be later decreased which will allow creating new dataseries again. The number of rejections should be tracked in a separate counter as a feedback to the maintainer.

This prevents OOM in any scenario, but may lead to some metrics not being collected.

3. clean up metrics on pod deletion

If metric is attributed to pod or CiliumIdentity we can track removals of these elements and remove related metrics. This solution is better than solution 1 in case of infrequent traffic.

On the other hand it can be quite challenging to determine which data series should be removed. eg. when source="pod-a" and destination="pod-b" should we remove such data series when "pod-a" is deleted? Should we wait for both pods to be deleted?
If all pods talk to long-running pods like dns-server we may keep them forever (until dns pod is deleted).

4. introduce label or annotation to exclude certain targets/pods from collecting metrics

We can additionally label/annotate some pods with cilium.io/no-metrics: true label and in case of this annotation being present do not return any label in the context resolver. This will allow excluding certain workloads/pods from generating data series (pod scanner scenario).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/metricsImpacts statistics / metrics gathering, eg via Prometheus.kind/featureThis introduces new functionality.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions