Skip to content

Conversation

ScheererJ
Copy link
Member

How to categorize this PR?

/area networking
/area ops-productivity
/area robustness
/kind enhancement

What this PR does / why we need it:

Use /livez endpoint of kube-proxy instead of /healthz endpoint for probes.

kube-proxy serves two paths on the health endpoint (port 10256) /livez and /healthz. Both check whether kube-proxy is healthy, but the latter also takes the node into account. It reports failure if the node either has a deletion timestamp or a taint by cluster-autoscaler indicating that the node is about to be deleted.
Previously, we used the /healthz endpoint for the readiness probe, which worked fine for indicating that kube-proxy as critical component was ready. However, it could lead to prolonged readiness failures in case a node was about to be deleted by cluster-autoscaler, but some PodDisruptionBudget or terminationGracePeriod delayed the removal. The result was failures in the system components health check. Now, these failures should not be present anymore.

Which issue(s) this PR fixes:
Fixes #11858

Special notes for your reviewer:

kube-proxy health check: https://github.com/kubernetes/kubernetes/blob/288b044e0f3617c7ec3c832edf925bfe7014e392/pkg/proxy/healthcheck/proxy_health.go#L204

Release note:

`kube-proxy` no longer fails its readiness probe in case the node is about to be deleted by `cluster-autoscaler`.

/cc @rfranzke @dguendisch @kon-angelo @LucaBernstein

@gardener-prow gardener-prow bot requested a review from rfranzke May 6, 2025 13:45
@gardener-prow gardener-prow bot added the area/networking Networking related label May 6, 2025
@gardener-prow gardener-prow bot requested a review from dguendisch May 6, 2025 13:45
@gardener-prow gardener-prow bot added the area/ops-productivity Operator productivity related (how to improve operations) label May 6, 2025
@gardener-prow gardener-prow bot requested a review from LucaBernstein May 6, 2025 13:45
@gardener-prow gardener-prow bot added the area/robustness Robustness, reliability, resilience related label May 6, 2025
Copy link
Contributor

gardener-prow bot commented May 6, 2025

@ScheererJ: GitHub didn't allow me to request PR reviews from the following users: kon-angelo.

Note that only gardener members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

How to categorize this PR?

/area networking
/area ops-productivity
/area robustness
/kind enhancement

What this PR does / why we need it:

Use /livez endpoint of kube-proxy instead of /healthz endpoint for probes.

kube-proxy serves two paths on the health endpoint (port 10256) /livez and /healthz. Both check whether kube-proxy is healthy, but the latter also takes the node into account. It reports failure if the node either has a deletion timestamp or a taint by cluster-autoscaler indicating that the node is about to be deleted.
Previously, we used the /healthz endpoint for the readiness probe, which worked fine for indicating that kube-proxy as critical component was ready. However, it could lead to prolonged readiness failures in case a node was about to be deleted by cluster-autoscaler, but some PodDisruptionBudget or terminationGracePeriod delayed the removal. The result was failures in the system components health check. Now, these failures should not be present anymore.

Which issue(s) this PR fixes:
Fixes #11858

Special notes for your reviewer:

kube-proxy health check: https://github.com/kubernetes/kubernetes/blob/288b044e0f3617c7ec3c832edf925bfe7014e392/pkg/proxy/healthcheck/proxy_health.go#L204

Release note:

`kube-proxy` no longer fails its readiness probe in case the node is about to be deleted by `cluster-autoscaler`.

/cc @rfranzke @dguendisch @kon-angelo @LucaBernstein

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gardener-prow gardener-prow bot added kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 6, 2025
probes.

`kube-proxy` serves two paths on the health endpoint (port 10256) `/livez` and `/healthz`.
Both check whether `kube-proxy` is healthy, but the latter also takes
the node into account. It reports failure if the node either has a
deletion timestamp or a taint by `cluster-autoscaler` indicating that
the node is about to be deleted.
Previously, we used the `/healthz` endpoint for the readiness probe,
which worked fine for indicating that `kube-proxy` as critical component
was ready. However, it could lead to prolonged readiness failures in
case a node was about to be deleted by `cluster-autoscaler`, but some
`PodDisruptionBudget` or `terminationGracePeriod` delayed the removal.
The result was failures in the system components health check.
Now, these failures should not be present anymore.
@ScheererJ ScheererJ force-pushed the fix/kube-proxy-readiness-with-cluster-autoscaler branch from 8994042 to 27ccc87 Compare May 7, 2025 07:01
Copy link
Member

@tobschli tobschli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
Thank you very much :)
/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2025
Copy link
Contributor

gardener-prow bot commented May 7, 2025

LGTM label has been added.

Git tree hash: 8fe4372bb013b719a2c7f43be4c2f107b76fb659

Copy link
Contributor

@plkokanov plkokanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think /livez was added when /healthz behaviour was modified in kubernetes/kubernetes#116470, which seems to be from k8s 1.28.0
Should we keep the old probe for k8s 1.27.0?

@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label May 8, 2025
@gardener-prow gardener-prow bot requested a review from tobschli May 8, 2025 10:56
@gardener-prow gardener-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 8, 2025
@ScheererJ
Copy link
Member Author

I think /livez was added when /healthz behaviour was modified in kubernetes/kubernetes#116470, which seems to be from k8s 1.28.0 Should we keep the old probe for k8s 1.27.0?

I added logic to keep it as it was before for 1.27.x and only use the new endpoint starting with 1.28.0.

PTAL.

@plkokanov
Copy link
Contributor

The pull-gardener-e2e-kind-ha-multi-zone failed due to a timeout while waiting for the shoot's system components to become healthy. One of the reasons for that was that the kube-proxy daemonsets did not become healthy quickly enough:

  2025-05-08T11:23:47.611Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Labeling resources after modification of encryption config or to encrypt them with new ETCD encryption key, Waiting until shoot worker nodes have been reconciled,LastUpdateTime:2025-05-08 11:23:42 +0000 UTC,Progress:85,State:Processing,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message DaemonSet \"kube-system/kube-proxy-manual-v1.32.0\" is unhealthy: too many unavailable pods found (1/1) with reason DaemonSetUnhealthy"}
  2025-05-08T11:24:17.620Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Waiting until shoot worker nodes have been reconciled,LastUpdateTime:2025-05-08 11:24:12 +0000 UTC,Progress:88,State:Processing,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message DaemonSet \"kube-system/kube-proxy-auto-v1.32.0\" is unhealthy: too many unavailable pods found (1/2) with reason DaemonSetUnhealthy"}
  2025-05-08T11:24:47.627Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Waiting until shoot worker nodes have been reconciled,LastUpdateTime:2025-05-08 11:24:12 +0000 UTC,Progress:88,State:Processing,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message DaemonSet \"kube-system/kube-proxy-auto-v1.32.0\" is unhealthy: too many unavailable pods found (1/2) with reason DaemonSetUnhealthy"}
  2025-05-08T11:25:17.640Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Waiting until shoot worker nodes have been reconciled,LastUpdateTime:2025-05-08 11:24:12 +0000 UTC,Progress:88,State:Processing,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message Deployment \"kube-system/metrics-server\" is unhealthy: condition \"Available\" has invalid status False (expected True) due to MinimumReplicasUnavailable: Deployment does not have minimum availability. with reason DeploymentUnhealthy"}
  2025-05-08T11:25:47.658Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Waiting until operating system configurations for worker nodes have been reconciled,LastUpdateTime:2025-05-08 11:25:44 +0000 UTC,Progress:70,State:Processing,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message DaemonSet \"kube-system/kube-proxy-auto-v1.32.0\" is unhealthy: too many unavailable pods found (1/2) with reason DaemonSetUnhealthy"}
  2025-05-08T11:26:17.664Z	INFO	Waiting for reconciliation and healthiness	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}, "lastOperation": "&LastOperation{Description:Shoot cluster has been successfully reconciled.,LastUpdateTime:2025-05-08 11:26:05 +0000 UTC,Progress:100,State:Succeeded,Type:Reconcile,}", "reason": "condition type SystemComponentsHealthy is not true yet, had message DaemonSet \"kube-system/kube-proxy-manual-v1.32.0\" is unhealthy: too many unavailable pods found (1/1) with reason DaemonSetUnhealthy"}
  2025-05-08T11:26:47.682Z	INFO	Shoot has been reconciled and is healthy	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}}
  < Exit �[1m[It]�[0m Wait for Shoot to be reconciled �[38;5;243m@ 05/08/25 11:26:47.682 (10m30.389s)�[0m
  < Exit �[1m[It]�[0m Wait for Shoot to be reconciled �[38;5;243m@ 05/08/25 11:26:47.682 (10m30.389s)�[0m
  �[38;5;243m<< Timeline�[0m
�[38;5;243m------------------------------�[0m
�[38;5;9m• [FAILED] [0.001 seconds]�[0m
�[0mShoot Tests �[38;5;243mCreate Shoot, Rotate Credentials and Delete Shoot �[0mShoot with workers �[38;5;243mwith workers rollout, in-place update strategy �[38;5;9m�[1m[It] ca rotation phase should be 'Prepared'�[0m �[38;5;204m[Shoot, default, credentials-rotation, basic, with-workers-rollout, in-place]�[0m
�[38;5;243m/home/prow/go/src/github.com/gardener/gardener/test/e2e/gardener/shoot/internal/rotation/certificate_authorities.go:148�[0m

  �[38;5;243mTimeline >>�[0m
  > Enter �[1m[BeforeEach]�[0m TOP-LEVEL �[38;5;243m@ 05/08/25 11:26:47.683�[0m
  < Exit �[1m[BeforeEach]�[0m TOP-LEVEL �[38;5;243m@ 05/08/25 11:26:47.683 (0s)�[0m
  > Enter �[1m[It]�[0m ca rotation phase should be 'Prepared' �[38;5;243m@ 05/08/25 11:26:47.683�[0m
  �[38;5;9m[FAILED]�[0m in [It] - /home/prow/go/src/github.com/gardener/gardener/test/e2e/gardener/shoot/internal/rotation/certificate_authorities.go:149 �[38;5;243m@ 05/08/25 11:26:47.683�[0m
  < Exit �[1m[It]�[0m ca rotation phase should be 'Prepared' �[38;5;243m@ 05/08/25 11:26:47.683 (0s)�[0m
  �[38;5;243m<< Timeline�[0m

  �[38;5;9m[FAILED] Expected
      <v1beta1.CredentialsRotationPhase>: Preparing
  to equal
      <v1beta1.CredentialsRotationPhase>: Prepared�[0m

@ScheererJ could you check if it's somehow related to this PR (I don't immediately see how) or just a flake?

@ScheererJ
Copy link
Member Author

@ScheererJ could you check if it's somehow related to this PR (I don't immediately see how) or just a flake?

As the test reported the following before failing, I think the reconciliation succeeded. Therefore, I think this is just a flake.

  2025-05-08T11:26:47.682Z	INFO	Shoot has been reconciled and is healthy	{"shoot": {"name":"e2e-rot-ip","namespace":"garden-local"}}

Copy link
Contributor

@plkokanov plkokanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2025
Copy link
Contributor

gardener-prow bot commented May 8, 2025

LGTM label has been added.

Git tree hash: 02cfa0ed4570a9c68fd82ee2264e0c36a7a60c8c

Copy link
Contributor

gardener-prow bot commented May 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: plkokanov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2025
@ialidzhikov
Copy link
Member

ialidzhikov commented May 8, 2025

Isn't there a /readyz endpoint of kube-proxy? My feeling so far was that /livez gets used for liveness probe or /readyz for readiness probe.

@ScheererJ
Copy link
Member Author

Isn't there a /readyz endpoint of kube-proxy? My feeling so far was that /livez gets used for liveness probe or /ready for readiness probe.

As far as I can tell from the kube-proxy source code, there does not seem to be any /ready//readyz endpoint. Previously, there was only /healthz, but with 1.28.0 /livez was added.

@plkokanov
Copy link
Contributor

plkokanov commented May 8, 2025

Isn't there a /readyz endpoint of kube-proxy? My feeling so far was that /livez gets used for liveness probe or /ready for readiness probe.

As far as I can tell from the kube-proxy source code, there does not seem to be any /ready//readyz endpoint. Previously, there was only /healthz, but with 1.28.0 /livez was added.

Additionally, according to the corresponding KEP the /livez endpoint was added to replace the old /healthz semantics 😅

@ialidzhikov
Copy link
Member

ialidzhikov commented May 8, 2025

Thanks for explaining! I just found confusing that /livez endpoint is used for readiness probe and there is no liveness probe and no /readyz endpoint. But looks like the confusion is caused by how the things are designed and implemented upstream. 🤷

@gardener-prow gardener-prow bot merged commit ceebdbf into gardener:master May 9, 2025
19 checks passed
@timuthy
Copy link
Member

timuthy commented May 12, 2025

@ScheererJ the change makes me wonder why we think it's fine to work around the improvement proposed and implemented in the mentioned KEP, esp. see motivation.

Can you share more information?

Have you considered ignoring unready kube-proxy pods for to-be-deleted nodes, instead?

@ScheererJ
Copy link
Member Author

@ScheererJ the change makes me wonder why we think it's fine to work around the improvement proposed and implemented in the mentioned KEP, esp. see motivation.

Can you share more information?

Have you considered ignoring unready kube-proxy pods for to-be-deleted nodes, instead?

We do not work around the improvement of KEP-3836 at all. Cloud controller managers can still use the /healthz endpoint of kube-proxy and they will get a failing result if the node is to be terminated. Therefore, infrastructures that support this will be able to drain connections before cluster-autoscaler will remove the node. (As far as I can tell, only GCP supports this.)

What we do different with this pull request is only changing the semantic of the readiness probe of the kube-proxy pods, which is not used by cloud-controller-manager anyway. We introduced the readiness probe so that the critical component check Gardener performs on node startup does something useful for kube-proxy so that failing startup of kube-proxy can prevent a node to reach the critical components ready stage.

It is indeed possible instead of adjusting the readiness probe to adapt the Gardener logic handling the health check. However, if you think it through there is a lot of specific logic required. If a non-ready Pod is encountered we need to check if it is kube-proxy and if so we would need to check the corresponding node for the cluster-autoscaler taint.

This was always the backup option if there was no other solution, but it is tedious and hard to understand later. The special handling also makes the health checks less clear. Therefore, I prefer the current solution.

Feel free to share your point of view if there is something obvious I missed.

@timuthy
Copy link
Member

timuthy commented May 14, 2025

Thanks @ScheererJ for the detailed information. I'm fine with the current solution as you could clarify KEP-3836 is still valid 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/networking Networking related area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt to changed kube-proxy readiness behaviour since K8s v1.31
5 participants