Skip to content

Conversation

DockToFuture
Copy link
Member

@DockToFuture DockToFuture commented Jun 30, 2025

How to categorize this PR?

/area networking
/kind enhancement

What this PR does / why we need it:
Starting from Kubernetes version 1.34, enabling or disabling node-local-dns will no longer trigger node rolling (except kube-proxy is running in IPVS mode). Instead, a cleanup job will be executed. Additionally, node-local-dns is deployed per WorkerPool and node-local-dns is removed from OSC and wokerPoolHash calculation.
Many infrastructures struggle to handle a large number of TCP connections for DNS queries, often resulting in rate throttling leading to "connection timeout" issues during DNS resolution. To address this, UDP connections will be preferred when communicating with the upstream DNS server.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Starting from Kubernetes version 1.34, enabling or disabling node-local-dns will no longer trigger node rolling (except kube-proxy is running in IPVS mode). Instead, a cleanup job will be executed. Additionally, node-local-dns is deployed per WorkerPool and node-local-dns will use UDP as default protocol for DNS queries to the upstream DNS server.
A separate `node-local-dns` `DaemonSet` is deployed for each worker pool such that each `DaemonSet` has the name `node-local-dns-<worker-pool-name>`.
If you are using `gardener-extension-networking-cilium` in your landscape, it is required to update it to a version which supports these new names for the `DaemonSet`s. 
Support is added with https://github.com/gardener/gardener-extension-networking-cilium/pull/622 and included in versions starting from: [`v1.42.1`](https://github.com/gardener/gardener-extension-networking-cilium/releases/tag/v1.42.1), [`v1.41.3`](https://github.com/gardener/gardener-extension-networking-cilium/releases/tag/v1.41.3) and [`v1.40.4`](https://github.com/gardener/gardener-extension-networking-cilium/releases/tag/v1.40.4)

Copy link
Contributor

gardener-prow bot commented Jun 30, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/networking Networking related kind/enhancement Enhancement, improvement, extension labels Jun 30, 2025
@gardener-prow gardener-prow bot requested a review from LucaBernstein June 30, 2025 09:19
@gardener-prow gardener-prow bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label Jun 30, 2025
@gardener-prow gardener-prow bot requested a review from oliver-goetz June 30, 2025 09:19
@gardener-prow gardener-prow bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 30, 2025
@DockToFuture DockToFuture marked this pull request as ready for review June 30, 2025 09:21
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2025
@gardener-prow gardener-prow bot requested review from rfranzke and tobschli June 30, 2025 09:21
@shafeeqes
Copy link
Contributor

Can you also adapt the validation at

allErrs = append(allErrs, field.Forbidden(fldPath.Child("systemComponents", "nodeLocalDNS"), "node-local-dns setting can not be changed if shoot has at least one worker pool with update strategy AutoInPlaceUpdate/ManualInPlaceUpdate"))
with the version check?

@DockToFuture DockToFuture force-pushed the update/node-local-dns branch 3 times, most recently from b2232dc to 886f3be Compare June 30, 2025 12:13
@ScheererJ
Copy link
Member

/assign

@DockToFuture DockToFuture changed the title Update/node local dns enable/disable node-local-dns without rolling nodes Jun 30, 2025
@DockToFuture DockToFuture force-pushed the update/node-local-dns branch from 886f3be to a2b1d99 Compare June 30, 2025 12:36
Copy link
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the operational improvement that reduces the amount of node rollouts.

@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 1, 2025
@shafeeqes
Copy link
Contributor

shafeeqes commented Jul 2, 2025

Please also adapt

#### In-Place Update Triggers
An in-place update of the shoot worker nodes is triggered for rolling update triggers listed under [Rolling Update Triggers](#rolling-update-triggers) except for the following:
* `.spec.provider.workers[].machine.image.name`
* `.spec.provider.workers[].machine.type`
* `.spec.provider.workers[].volume.type`
* `.spec.provider.workers[].volume.size`
* `.spec.provider.workers[].cri.name`
* `.spec.systemComponents.nodeLocalDNS.enabled`
and
#### Rolling Update Triggers
A rolling update of the shoot worker nodes is triggered for some changes to your worker pool specification (`.spec.provider.workers[]`, even if you don't change the Kubernetes or machine image version).
The complete list of fields that trigger a rolling update:
* `.spec.kubernetes.version` (except for patch version changes)
* `.spec.provider.workers[].machine.image.name`
* `.spec.provider.workers[].machine.image.version`
* `.spec.provider.workers[].machine.type`
* `.spec.provider.workers[].volume.type`
* `.spec.provider.workers[].volume.size`
* `.spec.provider.workers[].providerConfig` (provider extension dependent with feature gate `NewWorkerPoolHash`)
* `.spec.provider.workers[].cri.name`
* `.spec.provider.workers[].kubernetes.version` (except for patch version changes)
* `.spec.systemComponents.nodeLocalDNS.enabled`
* `.status.credentials.rotation.certificateAuthorities.lastInitiationTime` (changed by Gardener when a shoot CA rotation is initiated) when worker pool is not part of `.status.credentials.rotation.certificateAuthorities.pendingWorkersRollouts[]`
* `.status.credentials.rotation.serviceAccountKey.lastInitiationTime` (changed by Gardener when a shoot service account signing key rotation is initiated) when worker pool is not part of `.status.credentials.rotation.serviceAccountKey.pendingWorkersRollouts[]`
with the new version constraint.

@rfranzke
Copy link
Member

rfranzke commented Jul 8, 2025

/assign

@DockToFuture DockToFuture force-pushed the update/node-local-dns branch from a2b1d99 to a313ba1 Compare July 8, 2025 13:24
@gardener-prow gardener-prow bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2025
@DockToFuture DockToFuture force-pushed the update/node-local-dns branch 2 times, most recently from 32744d9 to 277bbd1 Compare July 8, 2025 13:54
@ScheererJ ScheererJ force-pushed the update/node-local-dns branch from 96194bd to 3428e70 Compare August 20, 2025 11:44
@ScheererJ
Copy link
Member

Thanks @ScheererJ, I'm fine with merging now if you make the tests pass :)

Apparently, golangci-lint wanted to highlight that the PR needed to be rebased as something would break with this change. It could have been more verbose, but now the tests should work again.

@ScheererJ
Copy link
Member

/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2025
Copy link
Contributor

gardener-prow bot commented Aug 21, 2025

LGTM label has been added.

Git tree hash: 46f9fcb85cd3d7f2dbc9d0879161488a92d37d74

Copy link
Contributor

gardener-prow bot commented Aug 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ScheererJ

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025
@gardener-prow gardener-prow bot merged commit 7b071a0 into gardener:master Aug 21, 2025
19 checks passed
ScheererJ added a commit to ScheererJ/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.
ScheererJ added a commit to gardener/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.
ScheererJ added a commit to gardener/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.

```other operator
Cilium extension now works with worker pool specific node-local-dns daemonsets.
```
ScheererJ added a commit to gardener/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.

```other operator
Cilium extension now works with worker pool specific node-local-dns daemonsets.
```
ScheererJ added a commit to gardener/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.

```other operator
Cilium extension now works with worker pool specific node-local-dns daemonsets.
```
gardener-prow bot pushed a commit to gardener/gardener-extension-networking-cilium that referenced this pull request Aug 22, 2025
* Handle old and new node-local-dns daemonsets in cilium webhook.

Before gardener/gardener#12422, node-local-dns
used a single daemonset for all nodes called node-local-dns. After that
change, each worker pool gets a separate daemonset with the name
node-local-dns-<worker-pool-name>.
The webhook mutating the daemonset needs to handle both cases
gracefully. Otherwise, node-local-dns pods may not work as expected.

* Run `make generate`
Duciwuci pushed a commit to stackitcloud/gardener that referenced this pull request Sep 1, 2025
* make node-local-dns configurable per worker group

* run node-local-dns cleanup script when disabling node-local-dns instead of rolling the nodes

* update node-local-dns e2e tests

* address feedback

* add cleanup label to nodes after disabling node local dns and address feedback

* roll nodes when kube-proxy runs in ipvs mode and node-local-dns gets disabled

* add check into IsKubeProxyIPVSMode function

* address feedback

* address feedback

* address feedback

* address feedback

* incorporate further feedback

* incorporate additional feedback

* fix case when some workers are below 1.34 and some are equal or above

* do check controlplane version check before loop

* Address review feedback

* Address review feedback

* Fix operating system configuration test

---------

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>
@DockToFuture DockToFuture deleted the update/node-local-dns branch September 2, 2025 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/networking Networking related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants