-
Notifications
You must be signed in to change notification settings - Fork 526
Clean up obsolete prometheus-
folder from Prometheus volumes
#12219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up obsolete prometheus-
folder from Prometheus volumes
#12219
Conversation
Skipping CI for Draft Pull Request. |
4bc68cd
to
4ee8da4
Compare
43562e6
to
87d3e2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
pkg/component/observability/monitoring/prometheus/prometheus.go
Outdated
Show resolved
Hide resolved
87d3e2f
to
26dd380
Compare
ccb35ba
to
2dfa820
Compare
/retest-required |
This commit is only the skeleton to clean up shoot Prometheus folders as part of the startup of the gardenlet. It iterates over the Cluster resources in the seed to find shoots, but executes nothing yet.
Shoot Prometheus instances are actually managed by the GRM. This means any patches directly applied to the Prometheus custom resources will be instantly reverted unless the corresponding ManagedResource is ignored via annotation: resources.gardener.cloud/ignore="true" Upon completion, the ManagedResource can be unignored for the GRM to reset the Prometheus state. This is done in a deferred call to make sure the annotation is always removed, also in case of failures. If the gardenlet process is killed before the cleanup tasks are complete, the ignore annotation could be leaked. To guarantee the state is eventually consistent, the gardenlet makes sure to remove ignore annotations before start up.
This commit prepares the shoot Prometheus cleanup by creating an empty patch for Prometheus resources, applying it and waiting for shoot Prometheus pods to be in running state. The following commits will populate such empty patch.
This is necessary for the gardenlet to be able to read and patch Prometheus resources.
This commit makes sure all shoot Prometheus instances receive the init container that removes the obsolete "prometheus-" folders. Such init container is only injected if it is still not present in the Prometheus resource.
This guarantees the init container runs for hibernated shoots. For running shoots, setting the number of replicas to 1 is a no-op.
If there are many hibernated shoots that need to be woken up, new nodes will need to be spun up, delaying the startup of the gardenlet in few minutes. This commit improves this by letting this cleanup run in the background. Note this is implemented in a way that it can never return an error. Failures can still happen and errors are properly logged but they are not propagated back to the gardenlet. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
This commit uses a temporal annotation to mark shoot Prometheus instances already cleaned up. This way, the gardenlet doesn't repeatedly attempt to clean up every shoot Prometheus on every restart. This temporal annotation will be removed in a second migration step.
This way, the gardenlet startup does not execute the cleanup for new shoot Prometheus instances. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
We need to preserve when a shoot Prometheus resource is marked as cleaned up so that gardenlet restarts do not keep attempting to clean up Prometheus instances that were already done. When a new shoot Prometheus instance is deployed, the GRM detects the Prometheus resource does not exist yet and creates it directly with the cleaned up mark. However, the first time the shoot is reconciled the Prometheus resource exists and the mark is reverted. This is not the case for migrated shoot Prometheus instances, where the annotation is owned by the gardenlet. To fix this, this commit makes sure the GRM preserves the annotation across shoot reconciliations. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Ignoring the errors will guarantee cleaning up the obsolete "prometheus-" folders won't interfere with the startup of the gardenlet. However, it won't guarantee that all Prometheus volumes will be successfully cleaned up once the migration code is removed. With this commit, we favor ensuring the obsolete folders are correctly cleaned up and accept the additional overhead in the gardenlet startup. If an error occurs, now the gardenlet will fail and restart.
After returning errors in previous commit, a new error might be returned if the Cluster resource was not yet installed in the seed when the gardenlet runs.
The retry method does not log returned errors between retries. This commit improves this by logging the error before returning.
Ensure proper error handling when unignoring a ManagedResource: - Log potential unignore errors but return the original error if cleanup fails. - Return the unignore error if cleanup succeeds but unignore fails. This update removes the defer call to unignore the ManagedResource, allowing explicit logging or returning of unignore errors as needed. To enhance code readability and minimize error checks, the unignore operation is now performed as late as possible.
c28b27d
to
d1e7cd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 282e783f1fce60ecab3b9e717796615f1f9302ad
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rfranzke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
gardener#12219)" This reverts commit 3b6c17c. git fetch origin pull/12219/head # d1e7cd5 git fetch origin pull/12728/head # 4b16e43 gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}') Conflicts: cmd/gardenlet/app/migration.go Resolve conflicts manually by removing the function definition and function call that was added in the original PR. pkg/apis/resources/v1alpha1/types.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/component.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/prometheus.go Manual conflict resolution.
gardener#12219)" This reverts commit 3b6c17c. git fetch origin pull/12219/head # d1e7cd5 git fetch origin pull/12728/head # 4b16e43 gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}') Conflicts: cmd/gardenlet/app/migration.go Resolve conflicts manually by removing the function definition and function call that was added in the original PR. pkg/apis/resources/v1alpha1/types.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/component.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/prometheus.go Manual conflict resolution.
gardener#12219)" This reverts commit 3b6c17c. The Prometheus annotations added in the original commit are not automatically removed by this revert. There will be a follow up commit to remove them. git fetch origin pull/12219/head # d1e7cd5 git fetch origin pull/12728/head # 4b16e43 gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}') Conflicts: cmd/gardenlet/app/migration.go Resolve conflicts manually by removing the function definition and function call that was added in the original PR. pkg/apis/resources/v1alpha1/types.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/component.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/prometheus.go Manual conflict resolution. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
…up migration (#12728) * Revert "Clean up obsolete `prometheus-` folder from Prometheus volumes (#12219)" This reverts commit 3b6c17c. The Prometheus annotations added in the original commit are not automatically removed by this revert. There will be a follow up commit to remove them. git fetch origin pull/12219/head # d1e7cd5 git fetch origin pull/12728/head # 4b16e43 gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}') Conflicts: cmd/gardenlet/app/migration.go Resolve conflicts manually by removing the function definition and function call that was added in the original PR. pkg/apis/resources/v1alpha1/types.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/component.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/prometheus.go Manual conflict resolution. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> * Remove Prometheus annotation from gardenlet startup This commit introduces a new migration step during the gardenlet startup to remove the Prometheus annotation used in the previous migration step to mark it as cleaned up. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> --------- Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
…up migration (gardener#12728) * Revert "Clean up obsolete `prometheus-` folder from Prometheus volumes (gardener#12219)" This reverts commit 3b6c17c. The Prometheus annotations added in the original commit are not automatically removed by this revert. There will be a follow up commit to remove them. git fetch origin pull/12219/head # d1e7cd5 git fetch origin pull/12728/head # 4b16e43 gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}') Conflicts: cmd/gardenlet/app/migration.go Resolve conflicts manually by removing the function definition and function call that was added in the original PR. pkg/apis/resources/v1alpha1/types.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/component.go Manual conflict resolution. pkg/component/observability/monitoring/prometheus/prometheus.go Manual conflict resolution. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> * Remove Prometheus annotation from gardenlet startup This commit introduces a new migration step during the gardenlet startup to remove the Prometheus annotation used in the previous migration step to mark it as cleaned up. Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> --------- Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
We have recently discovered that some Prometheus volumes still contain the
prometheus-
folder with old Prometheus data. This folder was used for storage before GEP-19, when the Prometheus operator was introduced. With the operator, the storage folder is nowprometheus-db
:In the past, the
prometheus-
subpath was mounted instead. For example, in the seed.GEP-19 included a migration to transfer data from the old folder to the new one. However, it appears there are remnants where the
prometheus-
folder is either a sibling of the newprometheus-db
or a subfolder. Unfortunately, pinpointing the exact sequence of steps that led to this situation is now very challenging.In some cases, the old
prometheus-
folder has been observed consuming up to 7GB of data. This poses a risk of Prometheus running out of disk space, as it operates, for example, in a control plane Prometheus, on a 20GB volume with a disk retention setting of 15GB.This PR introduces an init container to Prometheus to remove the obsolete
prometheus-
folder. However, this init container is injected differently based on the Prometheus type:gardenlet
. Hibernated shoot Prometheus instances are accordingly woken up to ensure the cleanup runs. Thegardenlet
patches Prometheus custom resources on-the-fly, bypassing the previously ignored ManagedResource. The GRM reverts these patches once the ManagedResource is unignored upon completion. Finally, an annotation marks those shoot Prometheus instances successfully cleaned up:These cleanup steps will be undone in upcoming releases.
Special notes for your reviewer:
/cc @istvanballok @rickardsjp @chrkl @rfranzke
Release note: