Skip to content

Conversation

voelzmo
Copy link
Member

@voelzmo voelzmo commented May 13, 2025

How to categorize this PR?

/area monitoring
/area autoscaling
/kind bug

What this PR does / why we need it:
Remove the sum statements from the VPA Pod metrics and display the individual Pods instead. This gets a little bit more messy for things like the requests, as you don't end up with a connected graph, but with several, discrete sections. But, it also is more accurate, even when other issues exist with metrics scraping or label configuration.

Previously, we saw sometimes a bit misleading graphs like this, showing small spikes of twice the actual memory requests. This is due to the fact that when a Pod was evicted and a new one was created, there was a temporary state in which you were seeing data for two Pods, which got summed up. For prometheus the Pod identifier is always prometheus-shoot-0, therefore we have this effect.
Screenshot 2025-05-13 at 09 03 03

In other cases, the data showed the sum of the target recommendation, because we had two data points with different origins (e.g. shoot and seed)

Screenshot 2025-05-13 at 09 17 01

All these issues can be solved by adjusting things in the dashboard, like switching to UID sum (if you want to keep a sum statement) and ensuring that we don't record the same things twice from the seed and shoot perspective – however, if we keep the sum statements, we don't even see this, but rather get confused with values which don't match with reality.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Remove sum for VPA Pod metrics in 'recommendations' dashboard

@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/bug Bug labels May 13, 2025
Copy link
Contributor

gardener-prow bot commented May 13, 2025

@voelzmo: The label(s) area/autoscaling cannot be applied, because the repository doesn't have them.

In response to this:

How to categorize this PR?

/area monitoring
/area autoscaling
/kind bug

What this PR does / why we need it:
Remove the sum statements from the VPA Pod metrics and display the individual Pods instead. This gets a little bit more messy for things like the requests, as you don't end up with a connected graph, but with several, discrete sections. But, it also is more accurate, even when other issues exist with metrics scraping or label configuration.

Previously, we saw sometimes a bit misleading graphs like this, showing small spikes of twice the actual memory requests. This is due to the fact that when a Pod was evicted and a new one was created, there was a temporary state in which you were seeing data for two Pods, which got summed up. For prometheus the Pod identifier is always prometheus-shoot-0, therefore we have this effect.
Screenshot 2025-05-13 at 09 03 03

In other cases, the data showed the sum of the target recommendation, because we had two data points with different origins (e.g. shoot and seed)

Screenshot 2025-05-13 at 09 17 01

All these issues can be solved by adjusting things in the dashboard, like switching to UID sum (if you want to keep a sum statement) and ensuring that we don't record the same things twice from the seed and shoot perspective – however, if we keep the sum statements, we don't even see this, but rather get confused with values which don't match with reality.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Remove sum for VPA Pod metrics in 'recommendations' dashboard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label May 13, 2025
@gardener-prow gardener-prow bot requested review from oliver-goetz and tobschli May 13, 2025 07:27
@gardener-prow gardener-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 13, 2025
Copy link
Member

@istvanballok istvanballok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2025
Copy link
Contributor

gardener-prow bot commented May 13, 2025

LGTM label has been added.

Git tree hash: c97927cdb758c90bd846c844f75dc9a7e99f04b9

@rfranzke
Copy link
Member

/approve

Copy link
Contributor

gardener-prow bot commented May 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025
@voelzmo
Copy link
Member Author

voelzmo commented May 13, 2025

/label area/auto-scaling

Copy link
Contributor

gardener-prow bot commented May 13, 2025

@voelzmo: The label(s) /label area/auto-scaling cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, ipcei/oidc, ipcei/workload-identity, skip-review. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label area/auto-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot merged commit 5502667 into gardener:master May 13, 2025
19 checks passed
@oliver-goetz
Copy link
Member

/area auto-scaling

@gardener-prow gardener-prow bot added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label May 13, 2025
istvanballok added a commit to istvanballok/gardener that referenced this pull request May 22, 2025
gardener#12057 removed the aggregation,
which is meaningful to prevent visual artifacts and spikes, and the change
in color also shows the container restarts.

However, if there are several restarts, the legend is polluted with the
"actual usage" series: one entry for each restart.

One mitigation is to hide the series from the legend.
istvanballok added a commit to istvanballok/gardener that referenced this pull request May 22, 2025
If the container is restarted frequently, the legend is polluted with
entries for each "actual usage" series, because the aggregation
was removed in gardener#12057.

Removing the annotation has many benefits: avoids the spike artifacts
and the change in the color shows the container restarts.

By sorting the legend, we can mitigate the problem of polluted
legend entries.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>
istvanballok added a commit to istvanballok/gardener that referenced this pull request May 22, 2025
Similar to gardener#12057,
the aggregation could be removed for the CPU usage query as well.

Benefit: the change of color shows when the container restarts.
istvanballok added a commit to istvanballok/gardener that referenced this pull request Jun 18, 2025
gardener#12057 removed the aggregation,
which is meaningful to prevent visual artifacts and spikes, and the change
in color also shows the container restarts.

However, if there are several restarts, the legend is polluted with the
"actual usage" series: one entry for each restart.

One mitigation is to hide the series from the legend.
istvanballok added a commit to istvanballok/gardener that referenced this pull request Jun 18, 2025
If the container is restarted frequently, the legend is polluted with
entries for each "actual usage" series, because the aggregation
was removed in gardener#12057.

Removing the annotation has many benefits: avoids the spike artifacts
and the change in the color shows the container restarts.

By sorting the legend, we can mitigate the problem of polluted
legend entries.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>
istvanballok added a commit to istvanballok/gardener that referenced this pull request Jun 18, 2025
Similar to gardener#12057,
the aggregation could be removed for the CPU usage query as well.

Benefit: the change of color shows when the container restarts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants