Skip to content

Conversation

vicwicker
Copy link
Member

@vicwicker vicwicker commented May 30, 2025

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

We have recently discovered that some Prometheus volumes still contain the prometheus- folder with old Prometheus data. This folder was used for storage before GEP-19, when the Prometheus operator was introduced. With the operator, the storage folder is now prometheus-db:

> k get pods prometheus-shoot-0 -o json | jq -r '
    (.spec.volumes[] | select(.name == "prometheus-db")),
    (.spec.containers[].volumeMounts[] | select(.name == "prometheus-db"))'
{
  "name": "prometheus-db",
  "persistentVolumeClaim": {
    "claimName": "prometheus-db-prometheus-shoot-0"
  }
}
{
  "mountPath": "/prometheus",
  "name": "prometheus-db",
  "subPath": "prometheus-db"
}

In the past, the prometheus- subpath was mounted instead. For example, in the seed.

GEP-19 included a migration to transfer data from the old folder to the new one. However, it appears there are remnants where the prometheus- folder is either a sibling of the new prometheus-db or a subfolder. Unfortunately, pinpointing the exact sequence of steps that led to this situation is now very challenging.

In some cases, the old prometheus- folder has been observed consuming up to 7GB of data. This poses a risk of Prometheus running out of disk space, as it operates, for example, in a control plane Prometheus, on a 20GB volume with a disk retention setting of 15GB.

This PR introduces an init container to Prometheus to remove the obsolete prometheus- folder. However, this init container is injected differently based on the Prometheus type:

  • In shoot Prometheus instances, where reconciliation cannot be guaranteed to successfully occur, such init containers are injected during the startup of the gardenlet. Hibernated shoot Prometheus instances are accordingly woken up to ensure the cleanup runs. The gardenlet patches Prometheus custom resources on-the-fly, bypassing the previously ignored ManagedResource. The GRM reverts these patches once the ManagedResource is unignored upon completion. Finally, an annotation marks those shoot Prometheus instances successfully cleaned up:
monitoring.resources.gardener.cloud/prometheus-obsolete-folder-cleaned-up: "true"
  • Other Prometheus instances include the aforementioned init container as part of their reconciliation.

These cleanup steps will be undone in upcoming releases.

Special notes for your reviewer:

/cc @istvanballok @rickardsjp @chrkl @rfranzke

Release note:

Clean up obsolete `prometheus-` folder from Prometheus volumes. This might be a leftover of GEP-19

Copy link
Contributor

gardener-prow bot commented May 30, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot requested review from chrkl and istvanballok May 30, 2025 11:39
@gardener-prow gardener-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2025
@gardener-prow gardener-prow bot requested review from rfranzke and rickardsjp May 30, 2025 11:39
@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 30, 2025
@vicwicker vicwicker force-pushed the delete-old-prometheus-folder branch 2 times, most recently from 4bc68cd to 4ee8da4 Compare May 30, 2025 14:23
@vicwicker vicwicker marked this pull request as ready for review May 30, 2025 15:56
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2025
@vicwicker vicwicker force-pushed the delete-old-prometheus-folder branch 2 times, most recently from 43562e6 to 87d3e2f Compare June 2, 2025 11:18
Copy link
Member

@istvanballok istvanballok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2025
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2025
@gardener-prow gardener-prow bot requested a review from istvanballok June 2, 2025 12:25
@vicwicker vicwicker force-pushed the delete-old-prometheus-folder branch from 87d3e2f to 26dd380 Compare June 17, 2025 12:43
@gardener-prow gardener-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 17, 2025
@vicwicker vicwicker requested a review from rfranzke June 17, 2025 12:51
@vicwicker vicwicker force-pushed the delete-old-prometheus-folder branch 2 times, most recently from ccb35ba to 2dfa820 Compare June 17, 2025 13:04
@vicwicker
Copy link
Member Author

/retest-required

vicwicker and others added 15 commits June 26, 2025 17:36
This commit is only the skeleton to clean up shoot Prometheus folders as
part of the startup of the gardenlet. It iterates over the Cluster
resources in the seed to find shoots, but executes nothing yet.
Shoot Prometheus instances are actually managed by the GRM. This means
any patches directly applied to the Prometheus custom resources will be
instantly reverted unless the corresponding ManagedResource is ignored
via annotation:

    resources.gardener.cloud/ignore="true"

Upon completion, the ManagedResource can be unignored for the GRM to
reset the Prometheus state. This is done in a deferred call to make sure
the annotation is always removed, also in case of failures.

If the gardenlet process is killed before the cleanup tasks are
complete, the ignore annotation could be leaked. To guarantee the state
is eventually consistent, the gardenlet makes sure to remove ignore
annotations before start up.
This commit prepares the shoot Prometheus cleanup by creating an empty
patch for Prometheus resources, applying it and waiting for shoot
Prometheus pods to be in running state.

The following commits will populate such empty patch.
This is necessary for the gardenlet to be able to read and patch
Prometheus resources.
This commit makes sure all shoot Prometheus instances receive the init
container that removes the obsolete "prometheus-" folders. Such init
container is only injected if it is still not present in the Prometheus
resource.
This guarantees the init container runs for hibernated shoots. For
running shoots, setting the number of replicas to 1 is a no-op.
If there are many hibernated shoots that need to be woken up, new nodes
will need to be spun up, delaying the startup of the gardenlet in few
minutes.

This commit improves this by letting this cleanup run in the background.
Note this is implemented in a way that it can never return an error.
Failures can still happen and errors are properly logged but they are
not propagated back to the gardenlet.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
This commit uses a temporal annotation to mark shoot Prometheus
instances already cleaned up. This way, the gardenlet doesn't repeatedly
attempt to clean up every shoot Prometheus on every restart.

This temporal annotation will be removed in a second migration step.
This way, the gardenlet startup does not execute the cleanup for new
shoot Prometheus instances.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
We need to preserve when a shoot Prometheus resource is marked as
cleaned up so that gardenlet restarts do not keep attempting to clean up
Prometheus instances that were already done.

When a new shoot Prometheus instance is deployed, the GRM detects the
Prometheus resource does not exist yet and creates it directly with the
cleaned up mark. However, the first time the shoot is reconciled the
Prometheus resource exists and the mark is reverted. This is not the
case for migrated shoot Prometheus instances, where the annotation is
owned by the gardenlet.

To fix this, this commit makes sure the GRM preserves the annotation
across shoot reconciliations.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Ignoring the errors will guarantee cleaning up the obsolete
"prometheus-" folders won't interfere with the startup of the gardenlet.
However, it won't guarantee that all Prometheus volumes will be
successfully cleaned up once the migration code is removed.

With this commit, we favor ensuring the obsolete folders are correctly
cleaned up and accept the additional overhead in the gardenlet startup.
If an error occurs, now the gardenlet will fail and restart.
After returning errors in previous commit, a new error might be returned
if the Cluster resource was not yet installed in the seed when the
gardenlet runs.
The retry method does not log returned errors between retries. This
commit improves this by logging the error before returning.
Ensure proper error handling when unignoring a ManagedResource:

- Log potential unignore errors but return the original error if cleanup
  fails.
- Return the unignore error if cleanup succeeds but unignore fails.

This update removes the defer call to unignore the ManagedResource,
allowing explicit logging or returning of unignore errors as needed. To
enhance code readability and minimize error checks, the unignore
operation is now performed as late as possible.
@vicwicker vicwicker force-pushed the delete-old-prometheus-folder branch from c28b27d to d1e7cd5 Compare June 27, 2025 07:52
@vicwicker vicwicker requested a review from rfranzke June 27, 2025 09:55
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2025
Copy link
Contributor

gardener-prow bot commented Jun 27, 2025

LGTM label has been added.

Git tree hash: 282e783f1fce60ecab3b9e717796615f1f9302ad

Copy link
Contributor

gardener-prow bot commented Jun 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2025
@gardener-prow gardener-prow bot merged commit 3b6c17c into gardener:master Jun 27, 2025
19 checks passed
istvanballok added a commit to istvanballok/gardener that referenced this pull request Aug 18, 2025
gardener#12219)"

This reverts commit 3b6c17c.

git fetch origin pull/12219/head # d1e7cd5
git fetch origin pull/12728/head # 4b16e43
gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}')

Conflicts:
  cmd/gardenlet/app/migration.go
    Resolve conflicts manually by removing the function definition and
    function call that was added in the original PR.
  pkg/apis/resources/v1alpha1/types.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/component.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/prometheus.go
    Manual conflict resolution.
vicwicker added a commit to vicwicker/gardener that referenced this pull request Aug 18, 2025
gardener#12219)"

This reverts commit 3b6c17c.

git fetch origin pull/12219/head # d1e7cd5
git fetch origin pull/12728/head # 4b16e43
gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}')

Conflicts:
  cmd/gardenlet/app/migration.go
    Resolve conflicts manually by removing the function definition and
    function call that was added in the original PR.
  pkg/apis/resources/v1alpha1/types.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/component.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/prometheus.go
    Manual conflict resolution.
vicwicker added a commit to vicwicker/gardener that referenced this pull request Aug 18, 2025
gardener#12219)"

This reverts commit 3b6c17c. The
Prometheus annotations added in the original commit are not
automatically removed by this revert. There will be a follow up commit
to remove them.

git fetch origin pull/12219/head # d1e7cd5
git fetch origin pull/12728/head # 4b16e43
gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}')

Conflicts:
  cmd/gardenlet/app/migration.go
    Resolve conflicts manually by removing the function definition and
    function call that was added in the original PR.
  pkg/apis/resources/v1alpha1/types.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/component.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/prometheus.go
    Manual conflict resolution.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
gardener-prow bot pushed a commit that referenced this pull request Aug 20, 2025
…up migration (#12728)

* Revert "Clean up obsolete `prometheus-` folder from Prometheus volumes (#12219)"

This reverts commit 3b6c17c. The
Prometheus annotations added in the original commit are not
automatically removed by this revert. There will be a follow up commit
to remove them.

git fetch origin pull/12219/head # d1e7cd5
git fetch origin pull/12728/head # 4b16e43
gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}')

Conflicts:
  cmd/gardenlet/app/migration.go
    Resolve conflicts manually by removing the function definition and
    function call that was added in the original PR.
  pkg/apis/resources/v1alpha1/types.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/component.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/prometheus.go
    Manual conflict resolution.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>

* Remove Prometheus annotation from gardenlet startup

This commit introduces a new migration step during the gardenlet startup
to remove the Prometheus annotation used in the previous migration step
to mark it as cleaned up.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>

---------

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
@vicwicker vicwicker deleted the delete-old-prometheus-folder branch August 25, 2025 09:31
Duciwuci pushed a commit to stackitcloud/gardener that referenced this pull request Sep 1, 2025
…up migration (gardener#12728)

* Revert "Clean up obsolete `prometheus-` folder from Prometheus volumes (gardener#12219)"

This reverts commit 3b6c17c. The
Prometheus annotations added in the original commit are not
automatically removed by this revert. There will be a follow up commit
to remove them.

git fetch origin pull/12219/head # d1e7cd5
git fetch origin pull/12728/head # 4b16e43
gitk HEAD 4b16e43 d1e7cd5 ^3b6c17c808^ -- $(git log --oneline -1 --numstat 3b6c17c | tail -n +2 | awk '{print $3}')

Conflicts:
  cmd/gardenlet/app/migration.go
    Resolve conflicts manually by removing the function definition and
    function call that was added in the original PR.
  pkg/apis/resources/v1alpha1/types.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/component.go
    Manual conflict resolution.
  pkg/component/observability/monitoring/prometheus/prometheus.go
    Manual conflict resolution.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>

* Remove Prometheus annotation from gardenlet startup

This commit introduces a new migration step during the gardenlet startup
to remove the Prometheus annotation used in the previous migration step
to mark it as cleaned up.

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>

---------

Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants