Skip to content

Conversation

vicwicker
Copy link
Member

@vicwicker vicwicker commented Jan 9, 2025

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR adds a new set of alerts that fire as soon as a VPA recommendation is capped. A capped VPA recommendation will typically occur when the target recommendation exceeds the maximum allowed. In such a situation, the VPA will show a target recommendation that matches the maximum allowed and an uncapped target recommendation with the proposed value. This approach aims at making it more visible when a container's resource usage is abnormal.

These alerts cover every VPA deployed by Gardener. These can be found in both the kube-system and control-plane namespaces of shoots, in seeds and in garden clusters. In consequence, these alerts need to be defined in the shoot, aggregate and garden Prometheus. Not only that, the alerts from the shoot and aggregate Prometheus are federated into the garden Prometheus, where they are enriched with additional labels such as the landscape and sent via the garden Alertmanager. However, the alerts sent by the garden Prometheus, including the federated ones, serve as a minimal notification that a VPA recommendation hit its cap. Once operators receive this notification, they can query the garden Prometheus to find more details. This way, we avoid alert saturation in case multiple VPA recommendations across different components, seeds and shoots are capped.

This PR deploys such alerts in the corresponding Prometheus but they are not yet processed by the Alertmanagers. First, we want to assess how frequently they trigger. Thus, there will be a follow up PR enabling them into the Alertmanagers by setting the corresponding labels.

Note that #9934 removes the maxAllowed property from the API server's VPA. While this change is likely to decrease the frequency of this alert triggering, we believe it is still important to assert that this condition remains true in the future.

Special notes for your reviewer:

/cc @istvanballok @chrkl @rickardsjp

Release note:

Add alerts for capped VPA recommendations

@gardener-prow gardener-prow bot requested a review from istvanballok January 9, 2025 16:11
@gardener-prow gardener-prow bot added the area/monitoring Monitoring (including availability monitoring and alerting) related label Jan 9, 2025
Copy link
Contributor

gardener-prow bot commented Jan 9, 2025

@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp.

Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR adds a new set of alerts that fire as soon as a VPA recommendation is capped. A capped VPA recommendation will typically arise when the target recommendation exceeds the maximum allowed. In such a situation, the VPA will show a target recommendation that matches the maximum allowed and an uncapped target recommendation with the value it would set. Because a VPA attempts to learn the resource usage of a specific container, this approach aims at making it more visible when such usage is abnormal.

These alerts cover every VPA deployed by Gardener. These can be found in both the kube-system and control-plane namespaces of shoots, in seeds and in garden clusters. In consequence, these alerts need to be defined in the shoot, aggregate and garden Prometheus. Not only that, the alerts from the shoot and aggregate Prometheus are federated into the garden Prometheus, where they are enriched with additional labels such as the landscape and sent via the garden Alertmanager. The alerts sent by the garden Prometheus (including the federated ones), nevertheless, are a minimal notification that a VPA recommendation is capped. Once operators receive this notification, they can query the garden Prometheus to find more details. This way, we avoid overwhelming alerts in case multiple VPA recommendations across different components, seeds and shoots are capped.

Last but not least, this PR deploys such alerts in the corresponding Prometheus, but they are not yet processed by the Alertmanagers. First, we want to assess how frequently they trigger. Thus, there will be a follow up PR enabling them into the Alertmanagers by setting the corresponding labels.

Special notes for your reviewer:

/cc @istvanballok @chrkl @rickardsjp

Release note:

Add alerts for capped VPA recommendations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot added kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 9, 2025
@vicwicker
Copy link
Member Author

@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp.

Try again?
/cc @chrkl @rickardsjp

Copy link
Contributor

gardener-prow bot commented Jan 9, 2025

@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp.

Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@vicwicker: GitHub didn't allow me to request PR reviews from the following users: chrkl, rickardsjp.

Try again?
/cc @chrkl @rickardsjp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrkl
Copy link
Member

chrkl commented Jan 13, 2025

/lgtm

Copy link
Contributor

gardener-prow bot commented Jan 13, 2025

@chrkl: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrkl
Copy link
Member

chrkl commented Jan 13, 2025

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2025
Copy link
Contributor

gardener-prow bot commented Jan 13, 2025

LGTM label has been added.

Git tree hash: 5eb69209c4371d009596793a2dbd24466a3afcfc

@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2025
@vicwicker vicwicker force-pushed the add-capped-recommendation-alerts branch from 3475d90 to 94ad340 Compare January 15, 2025 09:58
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2025
@gardener-prow gardener-prow bot requested a review from chrkl January 15, 2025 09:58
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2025
This reduces the diff in the following commits by sorting them.
This reduces the diff in the following commits.
This alert is deployed twofold. First, within the aggregate Prometheus,
which in turn is configured to forward alerts to the seed Alertmanager.
Second, this alert is also federated from the aggregate into the garden
Prometheus. The former alert is routed via the default
`email-kubernetes-ops` route, while the latter allows for further
notifications from the garden Prometheus.
This alert is also deployed twofold. First, within the shoot Prometheus,
which in turn is configured to forward alerts to the seed Alertmanager.
The shoot Alertmanager does not play a role here because this alert is
not shared with shoot owners (i.e., the "visibility = operator" label).
Second, the alert is also federated into the garden Prometheus for
further notifications.

The namespace label is used to distinguish between the resources
deployed in the shoot kube-system namespace or in the control-plane
namespace in the seed. It has the "kube-system" value for the former
case and "shoot-control-plane" for the latter.
@vicwicker vicwicker force-pushed the add-capped-recommendation-alerts branch from 94ad340 to 19ac840 Compare January 15, 2025 10:00
@oliver-goetz
Copy link
Member

Thanks 😄

/approve

Copy link
Contributor

gardener-prow bot commented Jan 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oliver-goetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2025
@chrkl
Copy link
Member

chrkl commented Jan 15, 2025

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2025
Copy link
Contributor

gardener-prow bot commented Jan 15, 2025

LGTM label has been added.

Git tree hash: e05968aa0e43b14c8c1edbe09dcbbaa555d1e8e1

@gardener-prow gardener-prow bot merged commit 7aeaf6c into gardener:master Jan 15, 2025
19 checks passed
@vicwicker vicwicker deleted the add-capped-recommendation-alerts branch March 6, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants