Upgrade Prometheus to v3.2 #11552

vicwicker · 2025-02-28T15:12:44Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

In order to test this major upgrade, we followed the upgrade migration guide provided by Prometheus. This guide outlines the major changes involved in upgrading to Prometheus v3.x. Based on our assessment, there is only one potential breaking change for Gardener when the scrape protocol from a scrape endpoint is invalid.

The scrape protocol sets the data format that the endpoint uses to expose metrics. In short, when the scrape protocol is invalid, Prometheus v2.x falls back to using the default text protocol, while Prometheus v3.x will fail the scrape. Prometheus v3.x introduces a new fallback_scrape_protocol configuration property to restore the old behaviour from Prometheus v2.x, but this must be set for each scrape configuration, service monitor, or pod monitor. While we could use this property to avoid being affected by this breaking change, adding it to every scrape configuration feels a bit overwhelming for Gardener, where the scrape configurations are many. Therefore, we opted to validate during testing that none of the URLs scraped by the Prometheus instances managed by Gardener return an invalid scrape protocol.

Details on how we checked the scrape protocols are valid

The scrape protocol information is available in the Content-Type field in the response headers from the scrape URL. The list of URLs a Prometheus instance scrapes can be retrieved from its API: http://<prometheus-ip>:9090/api/v1/targets. As a general example, the following snippet can be run on a debug pod to iterate over all scrape URLs for a certain Prometheus instance and show the Content-Type field in their responses, effectively listing the scrape protocols they use.

for url in $(curl -Ss http://<prometheus-ip>:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(.health == "up").scrapeUrl | select(contains("/federate") | not)'); do
    curl -m 1 -i -Ss <additional-flags-for-certs-and-tokens> $url | grep -i content-type:
done

Running this in a debug pod requires it to have the same network policies and credentials as the Prometheus instance to reach and access the scrape URL. Note that these network policies and credentials differ for each Prometheus instance. By adjusting this snippet to each Prometheus instance and executing it, we observed all responses contained either Content-Type: text/plain; version=0.0.4 or Content-Type: text/plain, both of which are valid protocols.

Additionally, we validated the following on a local deployment of Gardener with Prometheus v3.x, among other checks:

There are no significant changes in the number of ingested samples.
No scrape targets are down.
We downloaded the existing series from every Prometheus instance on v2.x, upgraded to Prometheus v3.x, downloaded the series again and compared the old and the new series. None of the old were removed, and a few new series related Prometheus were added, as described in the Prometheus' release notes.

Special notes for your reviewer:

/cc @istvanballok @chrkl @rickardsjp

Release note:

Upgrade Prometheus to v3.2

gardener-prow · 2025-02-28T15:12:48Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

vicwicker · 2025-02-28T15:49:07Z

/hold

I just found by chance that the label instance="prometheus-cache.garden.svc:80" changes to instance="prometheus-cache.garden.svc" in Prometheus v3.x. I need to check if this could break something.

marc1404 · 2025-03-03T21:38:23Z

Does it make sense to jump to 3.2.1 directly? See also #11577

The unit tests were originally misconfigured because they query a test time series that only exists for 10 minutes and gets staled for 5 additional minutes. Note such time series is defined as 0x20 but the evaluation interval is 30 seconds. Furthermore, the tests would check the alert after 15 minutes despite it triggers after 10 minutes. It seems, in this situation, the alert remains triggered after 15 minutes in Prometheus v2 but not in Prometheus v3. This commit adjusts the tests to evaluate the alert after 10 minutes, in line with its configuration. This resolves the issue in Prometheus v3 where the alert does not remain triggered after 15 minutes.

If the scrape URL is provided using static configurations, Prometheus v2 will try to add the port based on the protocol, while Prometheus v3 will not. This is an issue with the seed Prometheus where the "instance" label changes from "prometheus-cache.garden.svc:80" to "prometheus-cache.garden.svc" after the upgrade. This commit updates the seed Prometheus to use service discovery to federate from the cache Prometheus in the seed. This not only resolves the label mismatch but will also make federation configurations consistent in the following commits.

This commit is analog to previous commit, but for the shoot Prometheus.

vicwicker · 2025-03-05T09:37:45Z

I just found by chance that the label instance="prometheus-cache.garden.svc:80" changes to instance="prometheus-cache.garden.svc" in Prometheus v3.x. I need to check if this could break something.

I found the seed, shoot and longterm Prometheus are affected because they used a static configuration for federation. The aggregate Prometheus has a service discovery configured and it was not affected. I now changed the seed, shoot and longterm Prometheus configuration to use service discovery for federation, as the aggregate Prometheus does. With this, I expect no change in the instance label for the seed and shoot Prometheus, but a change from prometheus-garden:80 to prometheus-garden.garden.svc:80 in the longterm Prometheus. That's okay IMHO because it makes it consistent with other federations.

/unhold

Does it make sense to jump to 3.2.1 directly? See also #11577

Thanks! I missed there was a patched version :)

The longterm Prometheus is also affected by the label change when federating from the garden Prometheus. In this case, the "instance" label changes from "prometheus-garden:80" to "prometheus-garden". Using service discovery will update the "instance" label to "prometheus-garden.garden.svc:80", but it will also make it consistent with other federation configurations.

rfranzke

Thank you :)

/lgtm

gardener-prow · 2025-03-05T16:26:54Z

LGTM label has been added.

Git tree hash: 5f6d3d78a0d78d5b89925d09a7533f96821e74c6

rickardsjp · 2025-03-07T08:04:22Z

/lgtm

marc1404

/approve

Thanks for taking care and +1 for the well-structured PR! 👏

gardener-prow · 2025-03-07T08:25:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marc1404

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [marc1404]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gardener-prow bot requested review from chrkl and istvanballok February 28, 2025 15:12

gardener-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025

gardener-prow bot requested a review from rickardsjp February 28, 2025 15:12

vicwicker marked this pull request as ready for review February 28, 2025 15:36

gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025

gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2025

vicwicker added 2 commits March 4, 2025 12:10

Update Prometheus to v3.2.1

aedc0e2

vicwicker force-pushed the prometheus-3.0 branch from 79b4f60 to 1b5f1ca Compare March 4, 2025 11:20

gardener-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 4, 2025

vicwicker added 2 commits March 4, 2025 13:54

Use service discovery to federate to the shoot Prometheus

c60a9c0

This commit is analog to previous commit, but for the shoot Prometheus.

timuthy mentioned this pull request Mar 4, 2025

Update prometheus to v3 (major) #11577

Closed

1 task

vicwicker force-pushed the prometheus-3.0 branch from 1b5f1ca to 986bd3a Compare March 4, 2025 14:11

gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2025

vicwicker force-pushed the prometheus-3.0 branch from 986bd3a to 413551b Compare March 5, 2025 09:52

vicwicker force-pushed the prometheus-3.0 branch from 413551b to d11c0f4 Compare March 5, 2025 09:53

vicwicker mentioned this pull request Dec 13, 2024

☂️ Continuous Enhancement of the Monitoring Stack gardener/monitoring#29

Open

rfranzke reviewed Mar 5, 2025

View reviewed changes

gardener-prow bot assigned rfranzke Mar 5, 2025

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2025

gardener-prow bot assigned rickardsjp Mar 7, 2025

marc1404 reviewed Mar 7, 2025

View reviewed changes

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2025

gardener-prow bot merged commit 9077593 into gardener:master Mar 7, 2025
19 checks passed

vicwicker mentioned this pull request Apr 10, 2025

Fix the recording rules shoot:apiserver_latency:percentage which is stopping after last release #11835

Merged

vicwicker deleted the prometheus-3.0 branch August 25, 2025 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade Prometheus to v3.2 #11552

Upgrade Prometheus to v3.2 #11552

Uh oh!

vicwicker commented Feb 28, 2025 •

edited

Loading

Uh oh!

gardener-prow bot commented Feb 28, 2025

Uh oh!

vicwicker commented Feb 28, 2025 •

edited

Loading

Uh oh!

marc1404 commented Mar 3, 2025

Uh oh!

vicwicker commented Mar 5, 2025

Uh oh!

rfranzke left a comment

Uh oh!

gardener-prow bot commented Mar 5, 2025

Uh oh!

rickardsjp commented Mar 7, 2025

Uh oh!

marc1404 left a comment

Uh oh!

gardener-prow bot commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

Upgrade Prometheus to v3.2 #11552

Upgrade Prometheus to v3.2 #11552

Uh oh!

Conversation

vicwicker commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gardener-prow bot commented Feb 28, 2025

Uh oh!

vicwicker commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc1404 commented Mar 3, 2025

Uh oh!

vicwicker commented Mar 5, 2025

Uh oh!

rfranzke left a comment

Choose a reason for hiding this comment

Uh oh!

gardener-prow bot commented Mar 5, 2025

Uh oh!

rickardsjp commented Mar 7, 2025

Uh oh!

marc1404 left a comment

Choose a reason for hiding this comment

Uh oh!

gardener-prow bot commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

vicwicker commented Feb 28, 2025 •

edited

Loading

vicwicker commented Feb 28, 2025 •

edited

Loading