-
Notifications
You must be signed in to change notification settings - Fork 527
Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove sum for VPA Pod metrics in 'recommendations' dashboard #12057
Conversation
@voelzmo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: c97927cdb758c90bd846c844f75dc9a7e99f04b9
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rfranzke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label area/auto-scaling |
@voelzmo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/area auto-scaling |
gardener#12057 removed the aggregation, which is meaningful to prevent visual artifacts and spikes, and the change in color also shows the container restarts. However, if there are several restarts, the legend is polluted with the "actual usage" series: one entry for each restart. One mitigation is to hide the series from the legend.
If the container is restarted frequently, the legend is polluted with entries for each "actual usage" series, because the aggregation was removed in gardener#12057. Removing the annotation has many benefits: avoids the spike artifacts and the change in the color shows the container restarts. By sorting the legend, we can mitigate the problem of polluted legend entries. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>
Similar to gardener#12057, the aggregation could be removed for the CPU usage query as well. Benefit: the change of color shows when the container restarts.
gardener#12057 removed the aggregation, which is meaningful to prevent visual artifacts and spikes, and the change in color also shows the container restarts. However, if there are several restarts, the legend is polluted with the "actual usage" series: one entry for each restart. One mitigation is to hide the series from the legend.
If the container is restarted frequently, the legend is polluted with entries for each "actual usage" series, because the aggregation was removed in gardener#12057. Removing the annotation has many benefits: avoids the spike artifacts and the change in the color shows the container restarts. By sorting the legend, we can mitigate the problem of polluted legend entries. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Victor Herrero Otal <victor.herrero.otal@sap.com>
Similar to gardener#12057, the aggregation could be removed for the CPU usage query as well. Benefit: the change of color shows when the container restarts.
How to categorize this PR?
/area monitoring
/area autoscaling
/kind bug
What this PR does / why we need it:
Remove the
sum
statements from the VPA Pod metrics and display the individual Pods instead. This gets a little bit more messy for things like therequests
, as you don't end up with a connected graph, but with several, discrete sections. But, it also is more accurate, even when other issues exist with metrics scraping or label configuration.Previously, we saw sometimes a bit misleading graphs like this, showing small spikes of twice the actual memory requests. This is due to the fact that when a Pod was evicted and a new one was created, there was a temporary state in which you were seeing data for two Pods, which got summed up. For

prometheus
thePod
identifier is alwaysprometheus-shoot-0
, therefore we have this effect.In other cases, the data showed the sum of the
target
recommendation, because we had two data points with differentorigins
(e.g.shoot
andseed
)All these issues can be solved by adjusting things in the dashboard, like switching to UID
sum
(if you want to keep asum
statement) and ensuring that we don't record the same things twice from theseed
andshoot
perspective – however, if we keep thesum
statements, we don't even see this, but rather get confused with values which don't match with reality.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: