Skip to content

Conversation

vicwicker
Copy link
Member

@vicwicker vicwicker commented Feb 28, 2025

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

In order to test this major upgrade, we followed the upgrade migration guide provided by Prometheus. This guide outlines the major changes involved in upgrading to Prometheus v3.x. Based on our assessment, there is only one potential breaking change for Gardener when the scrape protocol from a scrape endpoint is invalid.

The scrape protocol sets the data format that the endpoint uses to expose metrics. In short, when the scrape protocol is invalid, Prometheus v2.x falls back to using the default text protocol, while Prometheus v3.x will fail the scrape. Prometheus v3.x introduces a new fallback_scrape_protocol configuration property to restore the old behaviour from Prometheus v2.x, but this must be set for each scrape configuration, service monitor, or pod monitor. While we could use this property to avoid being affected by this breaking change, adding it to every scrape configuration feels a bit overwhelming for Gardener, where the scrape configurations are many. Therefore, we opted to validate during testing that none of the URLs scraped by the Prometheus instances managed by Gardener return an invalid scrape protocol.

Details on how we checked the scrape protocols are valid

The scrape protocol information is available in the Content-Type field in the response headers from the scrape URL. The list of URLs a Prometheus instance scrapes can be retrieved from its API: http://<prometheus-ip>:9090/api/v1/targets. As a general example, the following snippet can be run on a debug pod to iterate over all scrape URLs for a certain Prometheus instance and show the Content-Type field in their responses, effectively listing the scrape protocols they use.

for url in $(curl -Ss http://<prometheus-ip>:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(.health == "up").scrapeUrl | select(contains("/federate") | not)'); do
    curl -m 1 -i -Ss <additional-flags-for-certs-and-tokens> $url | grep -i content-type:
done

Running this in a debug pod requires it to have the same network policies and credentials as the Prometheus instance to reach and access the scrape URL. Note that these network policies and credentials differ for each Prometheus instance. By adjusting this snippet to each Prometheus instance and executing it, we observed all responses contained either Content-Type: text/plain; version=0.0.4 or Content-Type: text/plain, both of which are valid protocols.

Additionally, we validated the following on a local deployment of Gardener with Prometheus v3.x, among other checks:

  • There are no significant changes in the number of ingested samples.
  • No scrape targets are down.
  • We downloaded the existing series from every Prometheus instance on v2.x, upgraded to Prometheus v3.x, downloaded the series again and compared the old and the new series. None of the old were removed, and a few new series related Prometheus were added, as described in the Prometheus' release notes.

Special notes for your reviewer:

/cc @istvanballok @chrkl @rickardsjp

Release note:

Upgrade Prometheus to v3.2

Copy link
Contributor

gardener-prow bot commented Feb 28, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025
@gardener-prow gardener-prow bot requested a review from rickardsjp February 28, 2025 15:12
@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 28, 2025
@vicwicker vicwicker marked this pull request as ready for review February 28, 2025 15:36
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025
@vicwicker
Copy link
Member Author

vicwicker commented Feb 28, 2025

/hold

I just found by chance that the label instance="prometheus-cache.garden.svc:80" changes to instance="prometheus-cache.garden.svc" in Prometheus v3.x. I need to check if this could break something.

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2025
@marc1404
Copy link
Member

marc1404 commented Mar 3, 2025

Does it make sense to jump to 3.2.1 directly? See also #11577

The unit tests were originally misconfigured because they query a test
time series that only exists for 10 minutes and gets staled for 5
additional minutes. Note such time series is defined as 0x20 but the
evaluation interval is 30 seconds. Furthermore, the tests would check
the alert after 15 minutes despite it triggers after 10 minutes. It
seems, in this situation, the alert remains triggered after 15 minutes
in Prometheus v2 but not in Prometheus v3.

This commit adjusts the tests to evaluate the alert after 10 minutes, in
line with its configuration. This resolves the issue in Prometheus v3
where the alert does not remain triggered after 15 minutes.
@gardener-prow gardener-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 4, 2025
If the scrape URL is provided using static configurations, Prometheus v2
will try to add the port based on the protocol, while Prometheus v3 will
not. This is an issue with the seed Prometheus where the "instance"
label changes from "prometheus-cache.garden.svc:80" to
"prometheus-cache.garden.svc" after the upgrade.

This commit updates the seed Prometheus to use service discovery to
federate from the cache Prometheus in the seed. This not only resolves
the label mismatch but will also make federation configurations
consistent in the following commits.
This commit is analog to previous commit, but for the shoot Prometheus.
@vicwicker
Copy link
Member Author

I just found by chance that the label instance="prometheus-cache.garden.svc:80" changes to instance="prometheus-cache.garden.svc" in Prometheus v3.x. I need to check if this could break something.

I found the seed, shoot and longterm Prometheus are affected because they used a static configuration for federation. The aggregate Prometheus has a service discovery configured and it was not affected. I now changed the seed, shoot and longterm Prometheus configuration to use service discovery for federation, as the aggregate Prometheus does. With this, I expect no change in the instance label for the seed and shoot Prometheus, but a change from prometheus-garden:80 to prometheus-garden.garden.svc:80 in the longterm Prometheus. That's okay IMHO because it makes it consistent with other federations.

/unhold

Does it make sense to jump to 3.2.1 directly? See also #11577

Thanks! I missed there was a patched version :)

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2025
The longterm Prometheus is also affected by the label change when
federating from the garden Prometheus. In this case, the "instance" label
changes from "prometheus-garden:80" to "prometheus-garden".

Using service discovery will update the "instance" label to
"prometheus-garden.garden.svc:80", but it will also make it consistent
with other federation configurations.
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2025
Copy link
Contributor

gardener-prow bot commented Mar 5, 2025

LGTM label has been added.

Git tree hash: 5f6d3d78a0d78d5b89925d09a7533f96821e74c6

@rickardsjp
Copy link
Member

/lgtm

Copy link
Member

@marc1404 marc1404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks for taking care and +1 for the well-structured PR! 👏

Copy link
Contributor

gardener-prow bot commented Mar 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marc1404

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2025
@gardener-prow gardener-prow bot merged commit 9077593 into gardener:master Mar 7, 2025
19 checks passed
@vicwicker vicwicker deleted the prometheus-3.0 branch August 25, 2025 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants