cilium: fix socket termination for v4-in-v6 clients #39994

borkmann · 2025-06-11T13:48:58Z

(see commit desc)

(For KPR=true we also need the fix in #40026 to trigger the reconciler.)

@m0untains

@m0untains reported that v4-in-v6 mapped sockets connecting to a v4 UDP backend are not properly terminated: - Bind a SOCK_DGRAM AF_INET socket in a server. - Create a corresponding kubernetes Service for this server. - In a client, create a SOCK_DGRAM AF_INET6 socket. Configure the address to use a v4-mapped-on-v6 address type. E.g. if the kubernetes service address is 10.2.3.4, configure the address the client will connect to as ::ffff:10.2.3.4. - Use the connect() + send() syscalls to create a long-lived udp socket, and send packets at some interval (e.g. every 10 seconds) from the client to the server. Note: using sendto(), i.e. a short-lived socket effectively works around the issue, and does not produce the undesired behavior. - Restart the server - Notice how the packets from the client are still sent to the old server IP address. For the client v4-in-v6 case we store the revnat entry in cilium_lb4_reverse_sk map. When the backend goes down, we iterate all clients and the current logic derives where to iterate in netlink based on the backend's address family (in this case v4). But given the client is a v6 socket, it will never be found from the v4 iteration. So this means for all v4 backends, we also need to iterate all v6 sockets in addition to try and find a match. Closes: #39470 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Clean it up, not needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

The cgroup/sock_release hook implementation forgot about v6 sockets which are connected to v4 backends via v4-in-v6 addressing. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann · 2025-06-13T21:17:39Z

/test

In my local testing I ran Cilium as a standalone process rather than container. As a result of this, /var/run/cilium/netns was not mounted. From testing socket termination, it turned out that for a given service the first time a backend which had client connections was deleted was terminated properly, that is, client sockets got destroyed fine. However, readding the backend with new client connections, and then terminating it again turned out /not/ to work. After the grace period the client socket was still alive: c3-small-x86-02-bpf-fix> db/watch backends Address Instances Shadows NodeName 147.28.227.99:6443/TCP default/kubernetes (https) 10.244.0.128:80/TCP default/nginx-service c3-small-x86-02-bpf-fix 10.244.0.58:80/TCP default/nginx-service c3-small-x86-02-bpf-fix 10.244.0.231:9153/TCP kube-system/kube-dns (metrics) c3-small-x86-02-bpf-fix 10.244.0.231:53/UDP kube-system/kube-dns (dns) c3-small-x86-02-bpf-fix 10.244.0.231:53/TCP kube-system/kube-dns (dns-tcp) c3-small-x86-02-bpf-fix 10.244.0.213:9153/TCP kube-system/kube-dns (metrics) c3-small-x86-02-bpf-fix 10.244.0.213:53/UDP kube-system/kube-dns (dns) c3-small-x86-02-bpf-fix 10.244.0.213:53/TCP kube-system/kube-dns (dns-tcp) c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix (deleted) <-- termination worked 10.244.0.153:9090/UDP default/udp-service c3-small-x86-02-bpf-fix 10.244.0.153:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix 10.244.0.153:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix (deleted) <-- termination did not work From debugging what happened was that the reconciler was stuck when trying to iterate through all netns'es: goroutine 1207 [chan send, 3 minutes]: github.com/cilium/cilium/pkg/netns.All() /root/go/src/github.com/cilium/cilium/pkg/netns/netns_linux.go:263 +0xf3 github.com/cilium/cilium/pkg/loadbalancer/reconciler.terminateUDPConnectionsToBackend({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:249 +0x6ff github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop-range1({0xc003e75270?, 0x448be80?, 0xc0?}, 0xc0038bfab8?) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:156 +0x2bd github.com/cilium/statedb.(*changeIterator[...]).Next.func2() /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/statedb/iterator.go:327 +0xc4 github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:133 +0x38b github.com/cilium/cilium/pkg/loadbalancer/reconciler.registerSocketTermination.func1({0x50079f8?, 0xc0004e9650?}, {0x5010da0?, 0xc002ee6000?}) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:112 +0x5e github.com/cilium/hive/job.(*jobOneShot).start(0xc0007e3200, {0x50079f8, 0xc0004e9650}, 0x0?, {0x5010da0, 0xc0007e2c60}, {{{0xc001417c80, 0x1, 0x1}}, 0xc0019d5d40, ...}) /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/oneshot.go:138 +0x4fd created by github.com/cilium/hive/job.(*queuedJob).Start.func1 in goroutine 1 /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/job.go:126 +0x16f In particular the errCh in All() was unbuffered, and thus it can't send to it without blocking itself. We hit and error here since for my instance /var/run/cilium/netns was not mounted. That alone is not enough to fix as I subsequently hit a nil pointer dereference. The returned iter from p.NetNSOps.all needs a nil check since upon error we return nil here. After all that things worked as expected. Kudos to Jussi who quickly spotted the bug in errCh! Suggested-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann · 2025-06-16T10:26:59Z

/test

youngnick

LGTM from sig-k8s

borkmann · 2025-06-16T16:06:50Z

(As discussed with Max, given this is a regression and fix for KPR, we need to land it before release.)

joamaki

LGTM

dylandreimerink · 2025-06-17T08:51:41Z

bpf/bpf_sock.c

+static __always_inline int sock4_delete_revnat(struct bpf_sock *ctx,
+					       struct bpf_sock *ctx_full)


Nit: this is confusing. When I see ctx I expect its a usable context. I would make the first argument ctx_fake and the second ctx. Hopefully that hints that the first isn't usable as proper context.

Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port); since most of the fake ctx is uninitialized anyway.

Nit: this is confusing. When I see ctx I expect its a usable context. I would make the first argument ctx_fake and the second ctx. Hopefully that hints that the first isn't usable.

The ctx + ctx_full scheme is similarly as with other function signatures, so I didn't want to suddenly reverse the order compared to the others.

Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port); since most of the fake ctx is uninitialized anyway.

I can look into it, though then its a bit less in line compared to sock6_delete_revnat, so maybe we need sth similar there too to be consistent.

joamaki · 2025-06-19T11:41:11Z

pkg/loadbalancer/reconciler/termination.go

 		for err := range errs {
 			p.Log.Debug("Error opening netns, skipping",
 				logfields.Error, err)
 		}
+		if iter != nil {


This does not work. We must iterate over iter first as otherwise we'll block forever in for err := range errs.

borkmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. feature/socket-lb Impacts the Socket-LB part of Cilium's kube-proxy replacement. labels Jun 11, 2025

borkmann added 3 commits June 13, 2025 11:47

cilium, socklb: Remove useless signature

cd1541c

Clean it up, not needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

cilium, socklb: Handle cgroup/sock_release case for v4-in-v6 sockets

08fd79f

The cgroup/sock_release hook implementation forgot about v6 sockets which are connected to v4 backends via v4-in-v6 addressing. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann force-pushed the pr/sock-v4-in-v6 branch from 98b6da4 to 08fd79f Compare June 13, 2025 11:49

borkmann mentioned this pull request Jun 13, 2025

bpf-lb-sock-terminate-pod-connections does not correctly terminate long-lived udp ipv6 connections #39470

Closed

3 tasks

borkmann marked this pull request as ready for review June 13, 2025 21:17

borkmann requested review from a team as code owners June 13, 2025 21:17

borkmann requested a review from aditighag June 13, 2025 21:17

borkmann requested a review from a team as a code owner June 16, 2025 10:10

borkmann requested a review from youngnick June 16, 2025 10:10

borkmann force-pushed the pr/sock-v4-in-v6 branch from 6a5b501 to 8797929 Compare June 16, 2025 10:24

borkmann force-pushed the pr/sock-v4-in-v6 branch from 8797929 to 9cfc6c7 Compare June 16, 2025 10:25

borkmann added the backport/1.17 This PR represents a backport for Cilium 1.17.x of a PR that was merged to main. label Jun 16, 2025

youngnick approved these changes Jun 16, 2025

View reviewed changes

gentoo-root added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 16, 2025

borkmann added release-blocker/1.18 This issue will prevent the release of the next version of Cilium. and removed dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 16, 2025

github-project-automation bot added this to Release blockers Jun 16, 2025

github-project-automation bot moved this to Proposed in Release blockers Jun 16, 2025

joestringer assigned borkmann Jun 16, 2025

joestringer moved this from Proposed to Planned in Release blockers Jun 16, 2025

joestringer moved this from Planned to Active in Release blockers Jun 16, 2025

joestringer added this to the 1.18 milestone Jun 16, 2025

borkmann added needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch and removed backport/1.17 This PR represents a backport for Cilium 1.17.x of a PR that was merged to main. labels Jun 16, 2025

joamaki approved these changes Jun 17, 2025

View reviewed changes

dylandreimerink approved these changes Jun 17, 2025

View reviewed changes

borkmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 17, 2025

borkmann merged commit 2296784 into main Jun 17, 2025
316 of 320 checks passed

borkmann deleted the pr/sock-v4-in-v6 branch June 17, 2025 08:57

github-project-automation bot moved this from Active to Done in Release blockers Jun 17, 2025

joamaki mentioned this pull request Jun 17, 2025

v1.17 Backports 2025-06-17 #40094

Merged

5 tasks

joamaki added the backport/author The backport will be carried out by the author of the PR. label Jun 17, 2025

joamaki reviewed Jun 19, 2025

View reviewed changes

borkmann mentioned this pull request Jul 1, 2025

[v1.17] cilium: fix socket termination for v4-in-v6 clients #40295

Merged

borkmann added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cilium: fix socket termination for v4-in-v6 clients #39994

cilium: fix socket termination for v4-in-v6 clients #39994

Uh oh!

borkmann commented Jun 11, 2025 •

edited

Loading

Uh oh!

borkmann commented Jun 13, 2025

Uh oh!

borkmann commented Jun 16, 2025

Uh oh!

youngnick left a comment

Uh oh!

borkmann commented Jun 16, 2025

Uh oh!

joamaki left a comment

Uh oh!

dylandreimerink Jun 17, 2025 •

edited

Loading

Uh oh!

borkmann Jun 17, 2025

Uh oh!

Uh oh!

joamaki Jun 19, 2025

Uh oh!

Uh oh!

		static __always_inline int sock4_delete_revnat(struct bpf_sock *ctx,
		struct bpf_sock *ctx_full)

cilium: fix socket termination for v4-in-v6 clients #39994

cilium: fix socket termination for v4-in-v6 clients #39994

Uh oh!

Conversation

borkmann commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borkmann commented Jun 13, 2025

Uh oh!

borkmann commented Jun 16, 2025

Uh oh!

youngnick left a comment

Choose a reason for hiding this comment

Uh oh!

borkmann commented Jun 16, 2025

Uh oh!

joamaki left a comment

Choose a reason for hiding this comment

Uh oh!

dylandreimerink Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

borkmann Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joamaki Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

borkmann commented Jun 11, 2025 •

edited

Loading

dylandreimerink Jun 17, 2025 •

edited

Loading