Skip to content

Adapt to changed kube-proxy readiness behaviour since K8s v1.31 #11858

@dguendisch

Description

@dguendisch

How to categorize this issue?

/area ops-productivity
/kind enhancement

What would you like to be added:
Thanks to @elankath for digging out the following background:

kube-proxy changed its behaviour with K8s v1.31 to now fail its readiness probe as soon as a node has the ToBeDeletedByClusterAutoscaler taint.
This makes a shoot clusters fail is SystemComponentsHealthy condition due to a "failing" kube-proxy pod.
However customers might intentionally defer node removal until some important workload completes by e.g. setting huge terminationGracePeriodSeconds (we have cases where node removals are deferred by several hours).

In our Gardener installation we prefer to page support personal if a productive cluster has a failing SystemComponentsHealthy condition for more than x mins.
With the changed kube-proxy behaviour, a failing SystemComponentsHealthy condition is however no longer a signal of "unwanted/unexpected" cluster issues, but can now also mean "works-as-designed/nothing-to-do".

It would be great if Gardener could rework the SystemComponentsHealthy condition to account for those intentional kube-proxy readiness probe failures, e.g. by ignoring kube-proxy readiness probe failures when the respective node also has a ToBeDeletedByClusterAutoscaler taint.

References:

Why is this needed:
To re-establish the SystemComponentsHealthy as a signal of failure which needs to be investigated.

Metadata

Metadata

Assignees

Labels

area/ops-productivityOperator productivity related (how to improve operations)kind/enhancementEnhancement, improvement, extension

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions