Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057

voelzmo · 2025-05-13T07:27:20Z

How to categorize this PR?

/area monitoring
/area autoscaling
/kind bug

What this PR does / why we need it:
Remove the sum statements from the VPA Pod metrics and display the individual Pods instead. This gets a little bit more messy for things like the requests, as you don't end up with a connected graph, but with several, discrete sections. But, it also is more accurate, even when other issues exist with metrics scraping or label configuration.

Previously, we saw sometimes a bit misleading graphs like this, showing small spikes of twice the actual memory requests. This is due to the fact that when a Pod was evicted and a new one was created, there was a temporary state in which you were seeing data for two Pods, which got summed up. For prometheus the Pod identifier is always prometheus-shoot-0, therefore we have this effect.

In other cases, the data showed the sum of the target recommendation, because we had two data points with different origins (e.g. shoot and seed)

All these issues can be solved by adjusting things in the dashboard, like switching to UID sum (if you want to keep a sum statement) and ensuring that we don't record the same things twice from the seed and shoot perspective – however, if we keep the sum statements, we don't even see this, but rather get confused with values which don't match with reality.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Remove sum for VPA Pod metrics in 'recommendations' dashboard

gardener-prow · 2025-05-13T07:27:25Z

@voelzmo: The label(s) area/autoscaling cannot be applied, because the repository doesn't have them.

In response to this:

How to categorize this PR?

/area monitoring
/area autoscaling
/kind bug

What this PR does / why we need it:
Remove the sum statements from the VPA Pod metrics and display the individual Pods instead. This gets a little bit more messy for things like the requests, as you don't end up with a connected graph, but with several, discrete sections. But, it also is more accurate, even when other issues exist with metrics scraping or label configuration.

Previously, we saw sometimes a bit misleading graphs like this, showing small spikes of twice the actual memory requests. This is due to the fact that when a Pod was evicted and a new one was created, there was a temporary state in which you were seeing data for two Pods, which got summed up. For prometheus the Pod identifier is always prometheus-shoot-0, therefore we have this effect.

In other cases, the data showed the sum of the target recommendation, because we had two data points with different origins (e.g. shoot and seed)

All these issues can be solved by adjusting things in the dashboard, like switching to UID sum (if you want to keep a sum statement) and ensuring that we don't record the same things twice from the seed and shoot perspective – however, if we keep the sum statements, we don't even see this, but rather get confused with values which don't match with reality.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:
Remove sum for VPA Pod metrics in 'recommendations' dashboard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

istvanballok

/lgtm

gardener-prow · 2025-05-13T07:37:07Z

LGTM label has been added.

Git tree hash: c97927cdb758c90bd846c844f75dc9a7e99f04b9

rfranzke · 2025-05-13T07:41:01Z

/approve

gardener-prow · 2025-05-13T07:41:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rfranzke]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

voelzmo · 2025-05-13T08:27:43Z

/label area/auto-scaling

gardener-prow · 2025-05-13T08:27:47Z

@voelzmo: The label(s) /label area/auto-scaling cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, ipcei/oidc, ipcei/workload-identity, skip-review. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label area/auto-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

oliver-goetz · 2025-05-13T09:23:54Z

/area auto-scaling

gardener#12057 removed the aggregation, which is meaningful to prevent visual artifacts and spikes, and the change in color also shows the container restarts. However, if there are several restarts, the legend is polluted with the "actual usage" series: one entry for each restart. One mitigation is to hide the series from the legend.

If the container is restarted frequently, the legend is polluted with entries for each "actual usage" series, because the aggregation was removed in gardener#12057. Removing the annotation has many benefits: avoids the spike artifacts and the change in the color shows the container restarts. By sorting the legend, we can mitigate the problem of polluted legend entries. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>

Similar to gardener#12057, the aggregation could be removed for the CPU usage query as well. Benefit: the change of color shows when the container restarts.

gardener#12057 removed the aggregation, which is meaningful to prevent visual artifacts and spikes, and the change in color also shows the container restarts. However, if there are several restarts, the legend is polluted with the "actual usage" series: one entry for each restart. One mitigation is to hide the series from the legend.

If the container is restarted frequently, the legend is polluted with entries for each "actual usage" series, because the aggregation was removed in gardener#12057. Removing the annotation has many benefits: avoids the spike artifacts and the change in the color shows the container restarts. By sorting the legend, we can mitigate the problem of polluted legend entries. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>

Similar to gardener#12057, the aggregation could be removed for the CPU usage query as well. Benefit: the change of color shows when the container restarts.

Remove sum for VPA Pod metrics

dabee8b

gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/bug Bug labels May 13, 2025

gardener-prow bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label May 13, 2025

gardener-prow bot requested review from oliver-goetz and tobschli May 13, 2025 07:27

gardener-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 13, 2025

istvanballok reviewed May 13, 2025

View reviewed changes

gardener-prow bot assigned istvanballok May 13, 2025

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2025

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025

gardener-prow bot merged commit 5502667 into gardener:master May 13, 2025
19 checks passed

gardener-prow bot added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057

Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057

Uh oh!

voelzmo commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

istvanballok left a comment

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

rfranzke commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

voelzmo commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

Uh oh!

oliver-goetz commented May 13, 2025

Uh oh!

Uh oh!

Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057

Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057

Uh oh!

Conversation

voelzmo commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

istvanballok left a comment

Choose a reason for hiding this comment

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

rfranzke commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

voelzmo commented May 13, 2025

Uh oh!

gardener-prow bot commented May 13, 2025

Uh oh!

Uh oh!

oliver-goetz commented May 13, 2025

Uh oh!

Uh oh!