-
Notifications
You must be signed in to change notification settings - Fork 526
Upgrade Prometheus to v3.2 #11552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Prometheus to v3.2 #11552
Conversation
Skipping CI for Draft Pull Request. |
/hold I just found by chance that the label |
Does it make sense to jump to |
The unit tests were originally misconfigured because they query a test time series that only exists for 10 minutes and gets staled for 5 additional minutes. Note such time series is defined as 0x20 but the evaluation interval is 30 seconds. Furthermore, the tests would check the alert after 15 minutes despite it triggers after 10 minutes. It seems, in this situation, the alert remains triggered after 15 minutes in Prometheus v2 but not in Prometheus v3. This commit adjusts the tests to evaluate the alert after 10 minutes, in line with its configuration. This resolves the issue in Prometheus v3 where the alert does not remain triggered after 15 minutes.
79b4f60
to
1b5f1ca
Compare
If the scrape URL is provided using static configurations, Prometheus v2 will try to add the port based on the protocol, while Prometheus v3 will not. This is an issue with the seed Prometheus where the "instance" label changes from "prometheus-cache.garden.svc:80" to "prometheus-cache.garden.svc" after the upgrade. This commit updates the seed Prometheus to use service discovery to federate from the cache Prometheus in the seed. This not only resolves the label mismatch but will also make federation configurations consistent in the following commits.
This commit is analog to previous commit, but for the shoot Prometheus.
1b5f1ca
to
986bd3a
Compare
I found the seed, shoot and longterm Prometheus are affected because they used a static configuration for federation. The aggregate Prometheus has a service discovery configured and it was not affected. I now changed the seed, shoot and longterm Prometheus configuration to use service discovery for federation, as the aggregate Prometheus does. With this, I expect no change in the /unhold
Thanks! I missed there was a patched version :) |
986bd3a
to
413551b
Compare
The longterm Prometheus is also affected by the label change when federating from the garden Prometheus. In this case, the "instance" label changes from "prometheus-garden:80" to "prometheus-garden". Using service discovery will update the "instance" label to "prometheus-garden.garden.svc:80", but it will also make it consistent with other federation configurations.
413551b
to
d11c0f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
/lgtm
LGTM label has been added. Git tree hash: 5f6d3d78a0d78d5b89925d09a7533f96821e74c6
|
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Thanks for taking care and +1 for the well-structured PR! 👏
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marc1404 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
In order to test this major upgrade, we followed the upgrade migration guide provided by Prometheus. This guide outlines the major changes involved in upgrading to Prometheus v3.x. Based on our assessment, there is only one potential breaking change for Gardener when the scrape protocol from a scrape endpoint is invalid.
The scrape protocol sets the data format that the endpoint uses to expose metrics. In short, when the scrape protocol is invalid, Prometheus v2.x falls back to using the default text protocol, while Prometheus v3.x will fail the scrape. Prometheus v3.x introduces a new
fallback_scrape_protocol
configuration property to restore the old behaviour from Prometheus v2.x, but this must be set for each scrape configuration, service monitor, or pod monitor. While we could use this property to avoid being affected by this breaking change, adding it to every scrape configuration feels a bit overwhelming for Gardener, where the scrape configurations are many. Therefore, we opted to validate during testing that none of the URLs scraped by the Prometheus instances managed by Gardener return an invalid scrape protocol.Details on how we checked the scrape protocols are valid
The scrape protocol information is available in the
Content-Type
field in the response headers from the scrape URL. The list of URLs a Prometheus instance scrapes can be retrieved from its API:http://<prometheus-ip>:9090/api/v1/targets
. As a general example, the following snippet can be run on a debug pod to iterate over all scrape URLs for a certain Prometheus instance and show theContent-Type
field in their responses, effectively listing the scrape protocols they use.Running this in a debug pod requires it to have the same network policies and credentials as the Prometheus instance to reach and access the scrape URL. Note that these network policies and credentials differ for each Prometheus instance. By adjusting this snippet to each Prometheus instance and executing it, we observed all responses contained either
Content-Type: text/plain; version=0.0.4
orContent-Type: text/plain
, both of which are valid protocols.Additionally, we validated the following on a local deployment of Gardener with Prometheus v3.x, among other checks:
Special notes for your reviewer:
/cc @istvanballok @chrkl @rickardsjp
Release note: