-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cilium: fix socket termination for v4-in-v6 clients #39994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@m0untains reported that v4-in-v6 mapped sockets connecting to a v4 UDP backend are not properly terminated: - Bind a SOCK_DGRAM AF_INET socket in a server. - Create a corresponding kubernetes Service for this server. - In a client, create a SOCK_DGRAM AF_INET6 socket. Configure the address to use a v4-mapped-on-v6 address type. E.g. if the kubernetes service address is 10.2.3.4, configure the address the client will connect to as ::ffff:10.2.3.4. - Use the connect() + send() syscalls to create a long-lived udp socket, and send packets at some interval (e.g. every 10 seconds) from the client to the server. Note: using sendto(), i.e. a short-lived socket effectively works around the issue, and does not produce the undesired behavior. - Restart the server - Notice how the packets from the client are still sent to the old server IP address. For the client v4-in-v6 case we store the revnat entry in cilium_lb4_reverse_sk map. When the backend goes down, we iterate all clients and the current logic derives where to iterate in netlink based on the backend's address family (in this case v4). But given the client is a v6 socket, it will never be found from the v4 iteration. So this means for all v4 backends, we also need to iterate all v6 sockets in addition to try and find a match. Closes: #39470 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Clean it up, not needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The cgroup/sock_release hook implementation forgot about v6 sockets which are connected to v4 backends via v4-in-v6 addressing. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
98b6da4
to
08fd79f
Compare
/test |
6a5b501
to
8797929
Compare
In my local testing I ran Cilium as a standalone process rather than container. As a result of this, /var/run/cilium/netns was not mounted. From testing socket termination, it turned out that for a given service the first time a backend which had client connections was deleted was terminated properly, that is, client sockets got destroyed fine. However, readding the backend with new client connections, and then terminating it again turned out /not/ to work. After the grace period the client socket was still alive: c3-small-x86-02-bpf-fix> db/watch backends Address Instances Shadows NodeName 147.28.227.99:6443/TCP default/kubernetes (https) 10.244.0.128:80/TCP default/nginx-service c3-small-x86-02-bpf-fix 10.244.0.58:80/TCP default/nginx-service c3-small-x86-02-bpf-fix 10.244.0.231:9153/TCP kube-system/kube-dns (metrics) c3-small-x86-02-bpf-fix 10.244.0.231:53/UDP kube-system/kube-dns (dns) c3-small-x86-02-bpf-fix 10.244.0.231:53/TCP kube-system/kube-dns (dns-tcp) c3-small-x86-02-bpf-fix 10.244.0.213:9153/TCP kube-system/kube-dns (metrics) c3-small-x86-02-bpf-fix 10.244.0.213:53/UDP kube-system/kube-dns (dns) c3-small-x86-02-bpf-fix 10.244.0.213:53/TCP kube-system/kube-dns (dns-tcp) c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix 10.244.0.99:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix (deleted) <-- termination worked 10.244.0.153:9090/UDP default/udp-service c3-small-x86-02-bpf-fix 10.244.0.153:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix 10.244.0.153:9090/UDP default/udp-service [terminating] c3-small-x86-02-bpf-fix (deleted) <-- termination did not work From debugging what happened was that the reconciler was stuck when trying to iterate through all netns'es: goroutine 1207 [chan send, 3 minutes]: github.com/cilium/cilium/pkg/netns.All() /root/go/src/github.com/cilium/cilium/pkg/netns/netns_linux.go:263 +0xf3 github.com/cilium/cilium/pkg/loadbalancer/reconciler.terminateUDPConnectionsToBackend({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:249 +0x6ff github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop-range1({0xc003e75270?, 0x448be80?, 0xc0?}, 0xc0038bfab8?) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:156 +0x2bd github.com/cilium/statedb.(*changeIterator[...]).Next.func2() /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/statedb/iterator.go:327 +0xc4 github.com/cilium/cilium/pkg/loadbalancer/reconciler.socketTerminationLoop({{{}}, {0x4fd2370, 0xc001da62d0}, 0xc002a80c90, {0x5074e58, 0xc00207b980}, 0xc0019d5d40, {{0x2faf080, 0x2faf080, 0x10000, ...}, ...}, ...}, ...) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:133 +0x38b github.com/cilium/cilium/pkg/loadbalancer/reconciler.registerSocketTermination.func1({0x50079f8?, 0xc0004e9650?}, {0x5010da0?, 0xc002ee6000?}) /root/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/termination.go:112 +0x5e github.com/cilium/hive/job.(*jobOneShot).start(0xc0007e3200, {0x50079f8, 0xc0004e9650}, 0x0?, {0x5010da0, 0xc0007e2c60}, {{{0xc001417c80, 0x1, 0x1}}, 0xc0019d5d40, ...}) /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/oneshot.go:138 +0x4fd created by github.com/cilium/hive/job.(*queuedJob).Start.func1 in goroutine 1 /root/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/job.go:126 +0x16f In particular the errCh in All() was unbuffered, and thus it can't send to it without blocking itself. We hit and error here since for my instance /var/run/cilium/netns was not mounted. That alone is not enough to fix as I subsequently hit a nil pointer dereference. The returned iter from p.NetNSOps.all needs a nil check since upon error we return nil here. After all that things worked as expected. Kudos to Jussi who quickly spotted the bug in errCh! Suggested-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
8797929
to
9cfc6c7
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from sig-k8s
(As discussed with Max, given this is a regression and fix for KPR, we need to land it before release.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
static __always_inline int sock4_delete_revnat(struct bpf_sock *ctx, | ||
struct bpf_sock *ctx_full) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is confusing. When I see ctx
I expect its a usable context. I would make the first argument ctx_fake
and the second ctx
. Hopefully that hints that the first isn't usable as proper context.
Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port);
since most of the fake ctx is uninitialized anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is confusing. When I see ctx I expect its a usable context. I would make the first argument ctx_fake and the second ctx. Hopefully that hints that the first isn't usable.
The ctx + ctx_full scheme is similarly as with other function signatures, so I didn't want to suddenly reverse the order compared to the others.
Better yet, change to int sock4_delete_revnat(struct bpf_sock *ctx, __u32 address, __u16 port); since most of the fake ctx is uninitialized anyway.
I can look into it, though then its a bit less in line compared to sock6_delete_revnat, so maybe we need sth similar there too to be consistent.
for err := range errs { | ||
p.Log.Debug("Error opening netns, skipping", | ||
logfields.Error, err) | ||
} | ||
if iter != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. We must iterate over iter
first as otherwise we'll block forever in for err := range errs
.
(see commit desc)
Fixes: #39470
(For KPR=true we also need the fix in #40026 to trigger the reconciler.)