-
Notifications
You must be signed in to change notification settings - Fork 527
Add alerts for capped VPA recommendations #11136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerts for capped VPA recommendations #11136
Conversation
@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp. Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Try again? |
@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp. Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
854bb65
to
6716a4e
Compare
pkg/component/observability/monitoring/prometheus/aggregate/prometheusrules_test.go
Outdated
Show resolved
Hide resolved
/lgtm |
@chrkl: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lgtm |
LGTM label has been added. Git tree hash: 5eb69209c4371d009596793a2dbd24466a3afcfc
|
3475d90
to
94ad340
Compare
This reduces the diff in the following commits by sorting them.
This reduces the diff in the following commits.
This alert is deployed twofold. First, within the aggregate Prometheus, which in turn is configured to forward alerts to the seed Alertmanager. Second, this alert is also federated from the aggregate into the garden Prometheus. The former alert is routed via the default `email-kubernetes-ops` route, while the latter allows for further notifications from the garden Prometheus.
This alert is also deployed twofold. First, within the shoot Prometheus, which in turn is configured to forward alerts to the seed Alertmanager. The shoot Alertmanager does not play a role here because this alert is not shared with shoot owners (i.e., the "visibility = operator" label). Second, the alert is also federated into the garden Prometheus for further notifications. The namespace label is used to distinguish between the resources deployed in the shoot kube-system namespace or in the control-plane namespace in the seed. It has the "kube-system" value for the former case and "shoot-control-plane" for the latter.
94ad340
to
19ac840
Compare
Thanks 😄 /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: oliver-goetz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: e05968aa0e43b14c8c1edbe09dcbbaa555d1e8e1
|
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
This PR adds a new set of alerts that fire as soon as a VPA recommendation is capped. A capped VPA recommendation will typically occur when the target recommendation exceeds the maximum allowed. In such a situation, the VPA will show a target recommendation that matches the maximum allowed and an uncapped target recommendation with the proposed value. This approach aims at making it more visible when a container's resource usage is abnormal.
These alerts cover every VPA deployed by Gardener. These can be found in both the kube-system and control-plane namespaces of shoots, in seeds and in garden clusters. In consequence, these alerts need to be defined in the shoot, aggregate and garden Prometheus. Not only that, the alerts from the shoot and aggregate Prometheus are federated into the garden Prometheus, where they are enriched with additional labels such as the landscape and sent via the garden Alertmanager. However, the alerts sent by the garden Prometheus, including the federated ones, serve as a minimal notification that a VPA recommendation hit its cap. Once operators receive this notification, they can query the garden Prometheus to find more details. This way, we avoid alert saturation in case multiple VPA recommendations across different components, seeds and shoots are capped.
This PR deploys such alerts in the corresponding Prometheus but they are not yet processed by the Alertmanagers. First, we want to assess how frequently they trigger. Thus, there will be a follow up PR enabling them into the Alertmanagers by setting the corresponding labels.
Note that #9934 removes the
maxAllowed
property from the API server's VPA. While this change is likely to decrease the frequency of this alert triggering, we believe it is still important to assert that this condition remains true in the future.Special notes for your reviewer:
/cc @istvanballok @chrkl @rickardsjp
Release note: