-
Notifications
You must be signed in to change notification settings - Fork 527
Closed
Labels
area/dev-productivityDeveloper productivity related (how to improve development)Developer productivity related (how to improve development)area/ipceiIPCEI (Important Project of Common European Interest)IPCEI (Important Project of Common European Interest)area/monitoringMonitoring (including availability monitoring and alerting) relatedMonitoring (including availability monitoring and alerting) relatedkind/enhancementEnhancement, improvement, extensionEnhancement, improvement, extensionkind/epicLarge multi-story topicLarge multi-story topic
Description
How to categorize this issue?
/area dev-productivity monitoring
/kind enhancement
What would you like to be added:
The monitoring stack should be migrated from the current custom-built Helm charts to the prometheus-operator
as proposed in GEP-19.
Why is this needed:
GEP-19 has been accepted and merged a long while ago, hence we should strive for completing its implementation. Also, the garden cluster (managed via gardener-operator
, ref #7016) does not have a monitoring stack yet. Also, this increases the development productivity by cleaning up technical debt and improving the code.
Tasks:
- Deployment of
prometheus-operator
- Golang component package: [GEP-19] Introduce
prometheus-operator
in garden and seed clusters #9067 - Deployment to garden cluster via
gardener-operator
: [GEP-19] Introduceprometheus-operator
in garden and seed clusters #9067 - Deployment to seed clusters via
gardenlet
: [GEP-19] Introduceprometheus-operator
in garden and seed clusters #9067
- Golang component package: [GEP-19] Introduce
- Prometheis/Alertmanager responsible for seed cluster
- Prometheis/Alertmanager responsible for garden cluster
- Prometheus/Alertmanager responsible for shoot clusters
- [GEP-19] Migrate shoot Alertmanager deployment and configuration #9257
- [GEP-19] [shoot-care] Check for
alertmanager-shoot
when Gardener>= 1.90
#9335 - [GEP-19] Add namespace to value of PVC migration label in
PersistentVolume
s #9338 - [GEP-19] Add dedicated namespace label for PVCs in migration #9344
- [GEP-19] Ensure legacy resource cleanup for shoot Alertmanager is executed #9416
- [GEP-19] [shoot-care] Check for
- [GEP-19] Migrate shoot Prometheus deployment #9695
- Adapt shoot Prometheus configuration (scrape configs, rules, ...) for components deployed by
gardenlet
- [GEP-19] Migrate shoot Alertmanager deployment and configuration #9257
- Adapt how extensions provide their observability configuration
-
provider-alicloud
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-alicloud#720 -
provider-aws
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-aws#946 -
provider-azure
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-azure#853 -
provider-gcp
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-gcp#754 -
provider-openstack
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-openstack#766 -
provider-equinix-metal
: [GEP-19] Adapt monitoring configuration gardener-extension-provider-equinix-metal#307 -
networking-calico
: [GEP-19] Adapt monitoring configuration gardener-extension-networking-calico#394 -
networking-cilium
: [GEP-19] Adapt monitoring configuration gardener-extension-networking-cilium#307 -
shoot-cert-service
: [GEP-19] Adapt monitoring configuration gardener-extension-shoot-cert-service#257 -
shoot-oidc-service
: [GEP-19] Adapt monitoring configuration gardener-extension-shoot-oidc-service#193 -
shoot-lakom-service
: [GEP-19] Adapt monitoring configuration gardener-extension-shoot-lakom-service#87 -
shoot-networking-problemdetector
: [GEP-19] Adapt monitoring configuration gardener-extension-shoot-networking-problemdetector#142 -
shoot-rsyslog-relp
: [GEP-19] Adapt monitoring configuration gardener-extension-shoot-rsyslog-relp#99 -
registry-cache
: [GEP-19] Switch to the new contract of providing monitoring configuration gardener-extension-registry-cache#187
-
- Miscellaneous
- [GEP-19] Extend health checks of
gardener-resource-manager
for newPrometheus
andAlertmanager
resources #9163 -
Consider deployment of admission webhook server(abandoned for now due to other, more important topics) - [GEP-19] Add sidecar to Plutono for fetching dashboard
ConfigMap
s dynamically #9624
- [GEP-19] Extend health checks of
General notes for the migration (taken from #6319):
- Add temporary migration code for the Persistent volume. This ensures that no data is lost.
- Find the "old"
pvc
and itspv
and setpersistentVolumeReclaimPolicy=Retain
. - Delete the "old"
pvc
. - Create a Prometheus Object with a
volumeClaimTemplate
that references thepv
withvolumeName=<existing-pv>
- Migrate the data using an init container
- Remove the migration code after 1-2 releases
- Find the "old"
- Add all existing prometheus configuration to an
additionalScrapeConfig
. This will allow us to switch to theprometheus-operator
without creatingPodMonitors
andServiceMonitors
for each component and instead do that migration step by step. - Add all extension prometheus configuration to the same
additionalScrapeConfig
. This will allow extensions time to migrate as well. - Existing rules should be replaced with
PrometheusRules
. - Once all of these steps are completed, most of the configuration in the
additionalScrapeConfig
can be migrated toPodMonitors
andServiceMonitors
.
Metadata
Metadata
Assignees
Labels
area/dev-productivityDeveloper productivity related (how to improve development)Developer productivity related (how to improve development)area/ipceiIPCEI (Important Project of Common European Interest)IPCEI (Important Project of Common European Interest)area/monitoringMonitoring (including availability monitoring and alerting) relatedMonitoring (including availability monitoring and alerting) relatedkind/enhancementEnhancement, improvement, extensionEnhancement, improvement, extensionkind/epicLarge multi-story topicLarge multi-story topic