Skip to content

Conversation

tamilmani1989
Copy link
Contributor

@tamilmani1989 tamilmani1989 commented Jun 22, 2025

This PR addresses this KEP. kube-proxy healthz handler will return 503 if node is being deleted. Leveraged existing local node watcher to watch for deletion events and added IsBeingDeleted field in LocalNode structure. IsBeingDeleted field will be set to true on node deletion. Kube-proxy health handler will get local node details via get call and if IsBeingDeleted set, it returns 503 (ServiceUnavailable) error.

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #39911

Fix kube-proxy healthz to return 503 if node in terminating state  

@tamilmani1989 tamilmani1989 requested a review from a team as a code owner June 22, 2025 20:29
@tamilmani1989 tamilmani1989 requested a review from thorn3r June 22, 2025 20:29
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 22, 2025
@tamilmani1989 tamilmani1989 force-pushed the tm/kubeproxy_health_node_deletion branch from 6513be7 to f5d8d33 Compare June 22, 2025 20:51
@tamilmani1989 tamilmani1989 force-pushed the tm/kubeproxy_health_node_deletion branch from f5d8d33 to c8bf210 Compare June 25, 2025 07:40
@tamilmani1989
Copy link
Contributor Author

@thorn3r would you able to review my PR please?

@tamilmani1989 tamilmani1989 changed the title chore: kube-proxy-healthz to return 503 if node terminating fix: kube-proxy-healthz to return 503 if node terminating Jun 27, 2025
@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jun 30, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 30, 2025
@pchaigno
Copy link
Member

/test

This PR addresses this KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability#implementation-history.
kubeproxy healthz handler should return 503 if node is being deleted. Leverage local node watcher to watch for deletion events and added `IsBeingDeleted` field in LocalNode structure. `IsBeingDeleted` field will be set to true on node deletion. Kube-proxy health handler will get local node details and if  `IsBeingDeleted` set, it returns 503 (ServiceUnavailable) error.

Signed-off-by: Tamilmani <tamanoha@microsoft.com>
@tamilmani1989 tamilmani1989 force-pushed the tm/kubeproxy_health_node_deletion branch from c8bf210 to 892cfc1 Compare June 30, 2025 22:08
@tamilmani1989
Copy link
Contributor Author

/test

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Kudos for the thorough testing 👍

@11siddarth
Copy link

@tamilmani1989 Could you please provide instruction on how to implement/apply this fix/PR on our AKS environment that uses cilium.

@pchaigno pchaigno enabled auto-merge July 1, 2025 07:12
@pchaigno pchaigno added this pull request to the merge queue Jul 1, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 1, 2025
Merged via the queue into cilium:main with commit 5eca81c Jul 1, 2025
69 of 70 checks passed
@tklauser tklauser mentioned this pull request Jul 1, 2025
4 tasks
@tklauser tklauser added the backport/author The backport will be carried out by the author of the PR. label Jul 1, 2025
@tklauser
Copy link
Member

tklauser commented Jul 1, 2025

Hit non-trivial merge conflicts when attempting to backport this to v1.17. This will need a manual backport PR in case we still want this in v1.17.

@tklauser tklauser removed the backport/author The backport will be carried out by the author of the PR. label Jul 1, 2025
@tklauser tklauser mentioned this pull request Jul 1, 2025
2 tasks
@tklauser tklauser added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 1, 2025
@aanm aanm removed the backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. label Jul 1, 2025
@tamilmani1989
Copy link
Contributor Author

Hit non-trivial merge conflicts when attempting to backport this to v1.17. This will need a manual backport PR in case we still want this in v1.17.

created backport to 1.17 #40317

@11siddarth
Copy link

11siddarth commented Jul 2, 2025

@tamilmani1989
Thank you for working on it and getting this fix for the issue.
Had couple of questions with this fix :

  1. How can we implement this in AKS cluster for immediate testing. I see that my AKS cluster would pick cilium image from MCR for 1.17.4 , is your changes integrated/applied in MCR as well for cilium images ?
  2. Will this fix gracefully drain existing connections in the intermediate node avoiding service disruption during node termination or if it only stops new connections to the node to be deleted and still the existing connections get affected with this fix?

since this is specific to AKS, i moved the discussion to AKS thread: Azure/AKS#5038 (comment)

@aanm
Copy link
Member

aanm commented Aug 13, 2025

@tamilmani1989 FYI this PR has not been backported to v1.17 even tho it has the needs-backport/1.17 label.

@aanm
Copy link
Member

aanm commented Aug 13, 2025

@tamilmani1989 FYI this PR has not been backported to v1.17 even tho it has the needs-backport/1.17 label.

Oh it turns out it was backported in #40317. All good!

@aanm aanm removed the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/author The backport will be carried out by the author of the PR. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drain connections from a node gracefully when node is terminated.
7 participants