-
Notifications
You must be signed in to change notification settings - Fork 527
Use /livez
endpoint of kube-proxy
instead of /healthz
endpoint for probes.
#12015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use /livez
endpoint of kube-proxy
instead of /healthz
endpoint for probes.
#12015
Conversation
@ScheererJ: GitHub didn't allow me to request PR reviews from the following users: kon-angelo. Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
probes. `kube-proxy` serves two paths on the health endpoint (port 10256) `/livez` and `/healthz`. Both check whether `kube-proxy` is healthy, but the latter also takes the node into account. It reports failure if the node either has a deletion timestamp or a taint by `cluster-autoscaler` indicating that the node is about to be deleted. Previously, we used the `/healthz` endpoint for the readiness probe, which worked fine for indicating that `kube-proxy` as critical component was ready. However, it could lead to prolonged readiness failures in case a node was about to be deleted by `cluster-autoscaler`, but some `PodDisruptionBudget` or `terminationGracePeriod` delayed the removal. The result was failures in the system components health check. Now, these failures should not be present anymore.
8994042
to
27ccc87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Thank you very much :)
/lgtm
LGTM label has been added. Git tree hash: 8fe4372bb013b719a2c7f43be4c2f107b76fb659
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think /livez
was added when /healthz
behaviour was modified in kubernetes/kubernetes#116470, which seems to be from k8s 1.28.0
Should we keep the old probe for k8s 1.27.0
?
I added logic to keep it as it was before for PTAL. |
The pull-gardener-e2e-kind-ha-multi-zone failed due to a timeout while waiting for the shoot's system components to become healthy. One of the reasons for that was that the
@ScheererJ could you check if it's somehow related to this PR (I don't immediately see how) or just a flake? |
As the test reported the following before failing, I think the reconciliation succeeded. Therefore, I think this is just a flake.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 02cfa0ed4570a9c68fd82ee2264e0c36a7a60c8c
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: plkokanov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Isn't there a |
As far as I can tell from the |
Additionally, according to the corresponding KEP the |
Thanks for explaining! I just found confusing that |
@ScheererJ the change makes me wonder why we think it's fine to work around the improvement proposed and implemented in the mentioned KEP, esp. see motivation. Can you share more information? Have you considered ignoring unready |
We do not work around the improvement of KEP-3836 at all. Cloud controller managers can still use the What we do different with this pull request is only changing the semantic of the readiness probe of the It is indeed possible instead of adjusting the readiness probe to adapt the Gardener logic handling the health check. However, if you think it through there is a lot of specific logic required. If a non-ready This was always the backup option if there was no other solution, but it is tedious and hard to understand later. The special handling also makes the health checks less clear. Therefore, I prefer the current solution. Feel free to share your point of view if there is something obvious I missed. |
Thanks @ScheererJ for the detailed information. I'm fine with the current solution as you could clarify KEP-3836 is still valid 🙂 |
How to categorize this PR?
/area networking
/area ops-productivity
/area robustness
/kind enhancement
What this PR does / why we need it:
Use
/livez
endpoint ofkube-proxy
instead of/healthz
endpoint for probes.kube-proxy
serves two paths on the health endpoint (port 10256)/livez
and/healthz
. Both check whetherkube-proxy
is healthy, but the latter also takes the node into account. It reports failure if the node either has a deletion timestamp or a taint bycluster-autoscaler
indicating that the node is about to be deleted.Previously, we used the
/healthz
endpoint for the readiness probe, which worked fine for indicating thatkube-proxy
as critical component was ready. However, it could lead to prolonged readiness failures in case a node was about to be deleted bycluster-autoscaler
, but somePodDisruptionBudget
orterminationGracePeriod
delayed the removal. The result was failures in the system components health check. Now, these failures should not be present anymore.Which issue(s) this PR fixes:
Fixes #11858
Special notes for your reviewer:
kube-proxy
health check: https://github.com/kubernetes/kubernetes/blob/288b044e0f3617c7ec3c832edf925bfe7014e392/pkg/proxy/healthcheck/proxy_health.go#L204Release note:
/cc @rfranzke @dguendisch @kon-angelo @LucaBernstein