Skip to content

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Jun 11, 2025

(see commit desc)

Fixes: #39470

(For KPR=true we also need the fix in #40026 to trigger the reconciler.)

@borkmann borkmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. feature/socket-lb Impacts the Socket-LB part of Cilium's kube-proxy replacement. labels Jun 11, 2025
borkmann added 3 commits June 13, 2025 11:47
@m0untains reported that v4-in-v6 mapped sockets connecting to a v4 UDP
backend are not properly terminated:

  - Bind a SOCK_DGRAM AF_INET socket in a server.
  - Create a corresponding kubernetes Service for this server.
  - In a client, create a SOCK_DGRAM AF_INET6 socket. Configure the address
    to use a v4-mapped-on-v6 address type. E.g. if the kubernetes service
    address is 10.2.3.4, configure the address the client will connect to
    as ::ffff:10.2.3.4.
  - Use the connect() + send() syscalls to create a long-lived udp socket,
    and send packets at some interval (e.g. every 10 seconds) from the client
    to the server. Note: using sendto(), i.e. a short-lived socket effectively
    works around the issue, and does not produce the undesired behavior.
  - Restart the server
  - Notice how the packets from the client are still sent to the old server
    IP address.

For the client v4-in-v6 case we store the revnat entry in cilium_lb4_reverse_sk
map. When the backend goes down, we iterate all clients and the current logic
derives where to iterate in netlink based on the backend's address family (in
this case v4). But given the client is a v6 socket, it will never be found from
the v4 iteration. So this means for all v4 backends, we also need to iterate
all v6 sockets in addition to try and find a match.

Closes: #39470
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Clean it up, not needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The cgroup/sock_release hook implementation forgot about v6 sockets
which are connected to v4 backends via v4-in-v6 addressing.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann
Copy link
Member Author

/test

@borkmann borkmann requested a review from a team as a code owner June 16, 2025 10:10
@borkmann borkmann requested a review from youngnick June 16, 2025 10:10
In my local testing I ran Cilium as a standalone process rather than
container. As a result of this, /var/run/cilium/netns was not mounted.

From testing socket termination, it turned out that for a given service
the first time a backend which had client connections was deleted was
terminated properly, that is, client sockets got destroyed fine.

However, readding the backend with new client connections, and then
terminating it again turned out /not/ to work. After the grace period
the client socket was still alive:

  c3-small-x86-02-bpf-fix> db/watch backends
  Address                  Instances                        Shadows   NodeName
  147.28.227.99:6443/TCP   default/kubernetes (https)
  10.244.0.128:80/TCP      default/nginx-service                      c3-small-x86-02-bpf-fix
  10.244.0.58:80/TCP       default/nginx-service                      c3-small-x86-02-bpf-fix
  10.244.0.231:9153/TCP    kube-system/kube-dns (metrics)             c3-small-x86-02-bpf-fix
  10.244.0.231:53/UDP      kube-system/kube-dns (dns)                 c3-small-x86-02-bpf-fix
  10.244.0.231:53/TCP      kube-system/kube-dns (dns-tcp)             c3-small-x86-02-bpf-fix
  10.244.0.213:9153/TCP    kube-system/kube-dns (metrics)             c3-small-x86-02-bpf-fix
  10.244.0.213:53/UDP      kube-system/kube-dns (dns)                 c3-small-x86-02-bpf-fix
  10.244.0.213:53/TCP      kube-system/kube-dns (dns-tcp)             c3-small-x86-02-bpf-fix
  10.244.0.99:9090/UDP     default/udp-service                        c3-small-x86-02-bpf-fix
  10.244.0.99:9090/UDP     default/udp-service [terminating]             c3-small-x86-02-bpf-fix
  10.244.0.99:9090/UDP     default/udp-service [terminating]             c3-small-x86-02-bpf-fix (deleted) <-- termination worked
  10.244.0.153:9090/UDP    default/udp-service                           c3-small-x86-02-bpf-fix
  10.244.0.153:9090/UDP    default/udp-service [terminating]             c3-small-x86-02-bpf-fix
  10.244.0.153:9090/UDP    default/udp-service [terminating]             c3-small-x86-02-bpf-fix (deleted) <-- termination did not work

From debugging what happened was that the reconciler was stuck when
trying to iterate through all netns'es:

  goroutine 1207 [chan send, 3 minutes]:
  github.com/cilium/cilium/pkg/netns.All()
          /root/go/src/github.com/cilium/cilium/pkg/netns/netns_linux.go:263 +0xf3
  github.com/cilium/cilium/pkg/loadbalancer/reconciler.terminateUDPConnectionsToBackend({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...)
          /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:249 +0x6ff
  github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop-range1({0xc003e75270?, 0x448be80?, 0xc0?}, 0xc0038bfab8?)
          /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:156 +0x2bd
  github.com/cilium/statedb.(*changeIterator[...]).Next.func2()
          /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/statedb/iterator.go:327 +0xc4
  github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...)
          /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:133 +0x38b
  github.com/cilium/cilium/pkg/loadbalancer/reconciler.registerSocketTermination.func1({0x50079f8?, 0xc0004e9650?}, {0x5010da0?, 0xc002ee6000?})
          /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:112 +0x5e
  github.com/cilium/hive/job.(*jobOneShot).start(0xc0007e3200, {0x50079f8, 0xc0004e9650}, 0x0?, {0x5010da0, 0xc0007e2c60}, {{{0xc001417c80, 0x1, 0x1}}, 0xc0019d5d40, ...})
          /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/oneshot.go:138 +0x4fd
  created by github.com/cilium/hive/job.(*queuedJob).Start.func1 in goroutine 1
          /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/job.go:126 +0x16f

In particular the errCh in All() was unbuffered, and thus it can't
send to it without blocking itself. We hit and error here since for
my instance /var/run/cilium/netns was not mounted.

That alone is not enough to fix as I subsequently hit a nil pointer
dereference. The returned iter from p.NetNSOps.all needs a nil check
since upon error we return nil here.

After all that things worked as expected. Kudos to Jussi who quickly
spotted the bug in errCh!

Suggested-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann
Copy link
Member Author

/test

@borkmann borkmann added the backport/1.17 This PR represents a backport for Cilium 1.17.x of a PR that was merged to main. label Jun 16, 2025
Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from sig-k8s

@gentoo-root gentoo-root added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 16, 2025
@borkmann borkmann added release-blocker/1.18 This issue will prevent the release of the next version of Cilium. and removed dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 16, 2025
@borkmann
Copy link
Member Author

(As discussed with Max, given this is a regression and fix for KPR, we need to land it before release.)

@joestringer joestringer moved this from Proposed to Planned in Release blockers Jun 16, 2025
@joestringer joestringer moved this from Planned to Active in Release blockers Jun 16, 2025
@joestringer joestringer added this to the 1.18 milestone Jun 16, 2025
@borkmann borkmann added needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch and removed backport/1.17 This PR represents a backport for Cilium 1.17.x of a PR that was merged to main. labels Jun 16, 2025
Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +157 to +158
static __always_inline int sock4_delete_revnat(struct bpf_sock *ctx,
struct bpf_sock *ctx_full)
Copy link
Member

@dylandreimerink dylandreimerink Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is confusing. When I see ctx I expect its a usable context. I would make the first argument ctx_fake and the second ctx. Hopefully that hints that the first isn't usable as proper context.

Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port); since most of the fake ctx is uninitialized anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is confusing. When I see ctx I expect its a usable context. I would make the first argument ctx_fake and the second ctx. Hopefully that hints that the first isn't usable.

The ctx + ctx_full scheme is similarly as with other function signatures, so I didn't want to suddenly reverse the order compared to the others.

Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port); since most of the fake ctx is uninitialized anyway.

I can look into it, though then its a bit less in line compared to sock6_delete_revnat, so maybe we need sth similar there too to be consistent.

@borkmann borkmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 17, 2025
@borkmann borkmann merged commit 2296784 into main Jun 17, 2025
316 of 320 checks passed
@borkmann borkmann deleted the pr/sock-v4-in-v6 branch June 17, 2025 08:57
@github-project-automation github-project-automation bot moved this from Active to Done in Release blockers Jun 17, 2025
@joamaki joamaki mentioned this pull request Jun 17, 2025
5 tasks
@joamaki joamaki added the backport/author The backport will be carried out by the author of the PR. label Jun 17, 2025
for err := range errs {
p.Log.Debug("Error opening netns, skipping",
logfields.Error, err)
}
if iter != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. We must iterate over iter first as otherwise we'll block forever in for err := range errs.

@borkmann borkmann added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport/author The backport will be carried out by the author of the PR. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. feature/socket-lb Impacts the Socket-LB part of Cilium's kube-proxy replacement. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.18 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

bpf-lb-sock-terminate-pod-connections does not correctly terminate long-lived udp ipv6 connections
6 participants