-
Notifications
You must be signed in to change notification settings - Fork 3.4k
sockets: Terminate sockets with BPF socket iterators #38693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
2f1f523
to
eb68f15
Compare
7fafc73
to
86636db
Compare
Move cilium_lb(4|6)_reverse_sk map definitions to sock.h so that they can be shared with bpf_sock_term.c, a program added in subsequent commits which can destroy sockets connected to a particular backend. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Implement bpf_sock_term.c, the program that handles socket destruction, and its unit tests. cil_sock_(udp|tcp)_destroy_v4 and cil_sock_(udp|tcp)_destroy_v6 are meant to be used in conjunction with BPF socket iterators to iterate through all sockets in a network namespace and destroy any matching the configured filter. cilium_sock_term_filter is a variable containing the filter, a value consisting of a backend's address family, address, and port. The programs combine the current socket's cookie and the configured filter into a cilium_lb(4|6)_reverse_sk map key and check for the existence of that key to see if this socket was connected to the now obsolete backend. If so, it destroys it with the bpf_sock_destroy() kfunc. An earlier prototype of the program simply used DECLARE_CONFIG for each of the filter fields, requiring a new program to be configured and loaded for each new backend filter. This was convenient since there was no need to coordinate access to the variable in userspace between parallel calls to TerminateUDPConnectionsToBackend(), but was ultimately too slow (about 100x slower according to a simple benchmark). So, userspace needs to synchronize access to cilium_sock_term_filter and this program using a mutex. Each "iterator instance" needs to follow this sequence: 1) Acquire some lock. 2) Set the value of the filter in cilium_sock_term_filter. 3) Create a new socket iterator. 4) Read all items in the iterator. 5) Release the lock. It's OK for different calls to TerminateConnectionsToBackend() to intertwine as long as these "critical sections" are sequential between threads. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
bpf_sock_term.o was built from the start to be "clang free". Use bpf2go to embed compiled BPF bytecode and generate Go skeletons. Reorder struct definitions inside union u6addr to ensure that the generated Go struct has a [16]uint8 for the Addr field and name the anonymous structs to get bpf2go to stop complaining. Add llvm-strip to the builder image, a dependency of bpf2go, and use the builder image for BPF generation (`make -C bpf generate`) to ensure reproducible results on systems with different LLVM versions. Bump github.com/cilium/ebpf module version to pull in [1]. [1]: cilium/ebpf#1829 Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Implement an alternative strategy for destroying sockets in the current net namespace called BPFSocketDestroyer and move the original netlink-based Destroy() function to a new SocketDestroyer called NetlinkSocketDestroyer. At startup, check if the bpf_sock_destroy kfunc is supported on the current kernel, and if not, fall back to NetlinkSocketDestroyer. BPFSockDestroyer#Destroy() sets the value of the socket filter, initializes the socket iterator, then iterates to completion. It takes a lock over this whole sequence to ensure calls to Destroy() are serialized across parallel instances of TerminateConnectionsToBackend(). Create SocketDestroyer inside the one shot job inside registerSocketTermination to guarantee that all LB maps have been initialized, and plumb the socket revnat maps through to LoadSockTerm. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Introduce test coverage and benchmarks for both SocketDestroyer implementations. Fixes: cilium#37907 Signed-off-by: Jordan Rife <jrife@google.com>
Head branch was pushed to by a user without write access
ef63af4
to
471113d
Compare
/test |
/ci-ipsec-e2e |
Timo is out for a few weeks, and it looks like @jrife has done his best to address or respond to the earlier feedback. Strictly speaking that means that @cilium/loader hasn't given an approval, so I think it's reasonable to give a little extra time for someone there to see (maybe @rgo3 or @dylandreimerink ?) but if there's no further feedback in the next few days then it seems like this should be fine to merge. We can always iterate more in-tree. |
Filed #40925 for merge queue status check failure case, we've seen this before and it's unrelated to this PR. |
EDIT: @dylandreimerink found the culprit (🙏🏼 ), going to open a PR in a minute. 👋🏼 From commit
Any idea how to fix this? |
In #38693 we added a new bpf program bpf/bpf_sock_term.c, and generated the Go skeleton via bpf2go. Such targets, respectively pkg/datapath/bpf/sockterm_bpf{el,be}.go, are however excluded while building docker images via `make kind-image-agent`, resulting in the error `0.456 ../pkg/datapath/bpf/sockterm_bpfel.go:198:12: pattern sockterm_bpfel.o: no matching files found`. The generated Dockerfile.dockerignore includes the rule **/*.o which instructs docker to omit any .o files. But it also includes a rule !pkg/datapath/bpf/**/*.o which says it should add them to the context. Due to the ordering the second rule does not work. It is placed before the first in the file. And once you swap it, everything works. This commits modifies how we compute the GIT_IGNORE_FILES variable. The entire command finds all .gitignore files in the current directory and its subdirectories, excluding those within any vendor directories. It then sorts these .gitignore file paths by their "depth" in the directory structure from shallowest to deepest. Many thanks again to @dylandreimerink for the amazing investigation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
In #38693 we added a new bpf program bpf/bpf_sock_term.c, and generated the Go skeleton via bpf2go. Such targets, respectively pkg/datapath/bpf/sockterm_bpf{el,be}.go, are however excluded while building docker images via `make kind-image-agent`, resulting in the error `0.456 ../pkg/datapath/bpf/sockterm_bpfel.go:198:12: pattern sockterm_bpfel.o: no matching files found`. The generated Dockerfile.dockerignore includes the rule **/*.o which instructs docker to omit any .o files. But it also includes a rule !pkg/datapath/bpf/**/*.o which says it should add them to the context. Due to the ordering the second rule does not work. It is placed before the first in the file. And once you swap it, everything works. This commits modifies how we compute the GIT_IGNORE_FILES variable. The entire command finds all .gitignore files in the current directory and its subdirectories, excluding those within any vendor directories. It then sorts these .gitignore file paths by their "depth" in the directory structure from shallowest to deepest. Many thanks again to @dylandreimerink for the amazing investigation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
In #38693 we added a new bpf program bpf/bpf_sock_term.c, and generated the Go skeleton via bpf2go. Such targets, respectively pkg/datapath/bpf/sockterm_bpf{el,be}.go, are however excluded while building docker images via `make kind-image-agent`, resulting in the error `0.456 ../pkg/datapath/bpf/sockterm_bpfel.go:198:12: pattern sockterm_bpfel.o: no matching files found`. The generated Dockerfile.dockerignore includes the rule **/*.o which instructs docker to omit any .o files. But it also includes a rule !pkg/datapath/bpf/**/*.o which says it should add them to the context. Due to the ordering the second rule does not work. It is placed before the first in the file. And once you swap it, everything works. This commits modifies how we compute the GIT_IGNORE_FILES variable. The entire command finds all .gitignore files in the current directory and its subdirectories, excluding those within any vendor directories. It then sorts these .gitignore file paths by their "depth" in the directory structure from shallowest to deepest. Many thanks again to @dylandreimerink for the amazing investigation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
In cilium#38693 we added a new bpf program bpf/bpf_sock_term.c, and generated the Go skeleton via bpf2go. Such targets, respectively pkg/datapath/bpf/sockterm_bpf{el,be}.go, are however excluded while building docker images via `make kind-image-agent`, resulting in the error `0.456 ../pkg/datapath/bpf/sockterm_bpfel.go:198:12: pattern sockterm_bpfel.o: no matching files found`. The generated Dockerfile.dockerignore includes the rule **/*.o which instructs docker to omit any .o files. But it also includes a rule !pkg/datapath/bpf/**/*.o which says it should add them to the context. Due to the ordering the second rule does not work. It is placed before the first in the file. And once you swap it, everything works. This commits modifies how we compute the GIT_IGNORE_FILES variable. The entire command finds all .gitignore files in the current directory and its subdirectories, excluding those within any vendor directories. It then sorts these .gitignore file paths by their "depth" in the directory structure from shallowest to deepest. Many thanks again to @dylandreimerink for the amazing investigation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
This is a follow up to #37907...
bpf_sock_destroy()
, a kfunc introduced in the 6.5 kernel by @aditighag was implemented as a way to destroy UDP sockets connected to a stale UDP backend. While the original RFC mentions one possible advantage being the elimination of exposure and traversal of every network namespace, the final patch series still requires that socket iteration be done per namespace.However, this approach still reduces dependence on
CONFIG_INET_DIAG_DESTROY
, which the RFC claims is disabled by default. Perhaps a future kernel improvement to BPF socket iterators could allow for namespace iteration to be pushed to the BPF layer as well.It seems that this change has been planned for awhile, so this PR takes a stab at implementing it. Originally, this was blocked by the availability of
bpf_sock_destroy()
in production kernels and the availability of newer versions of LLVM inside Cilium. However, 6.8+ is now the default kernel version on many Ubuntu distros, and LLVM has long since been upgraded, not that the LLVM version shipped with Cilium matters much, since this PR bakes inbpf_sock_term.o
into the Cilium agent image (more on this below).This PR splits BPF, control plane, and test changes into different commits for easier review.
BPF Changes:
bpf_sock_term.c
bpf_sock_term.c
implementscil_sock_udp_destroy
.cil_sock_udp_destroy
is meant to be used in conjunction with BPF socket iterators to iterate through all sockets in a network namespace and destroy any matching the configured filter.cilium_sock_term_filter
is a single-element array containing the filter, a value consisting of a backend's address family, address, and port.An earlier prototype of the program simply used
DECLARE_CONFIG
for each of the filter fields, requiring a new program to be configured and loaded for each new backend filter. This was convenient since there was no need to coordinate access to the map in userspace between parallel calls toTerminateUDPConnectionsToBackend()
, but was ultimately too slow(about 100x slower according to a simple benchmark). So, userspace needs to synchronize access to the map and this program using a mutex. Each "iterator instance" needs to follow this sequence:
It's OK for different calls to
TerminateUDPConnectionsToBackend()
to intertwine as long as these "critical sections" are sequential between threads.The program combines the current socket's cookie and the configured filter into a
cilium_lb(4|6)_reverse_sk
map key and checks for the existence of that key to see if this socket was connected to the now obsolete backend. If so, it destroys it with thebpf_sock_destroy()
kfunc.A couple of points worth calling out:
bpf_sock_term.c
uses sparse definitions ofstruct bpf_iter__udp
and its dependencies to avoid pulling invmlinux.h
or a long chain of dependencies; it's not as straightforward as other definitions which can be cleanly copied in by syncing some header from the kernel.sk
, I couldn't pass that pointer to thebpf_sock_destroy()
kfunc without the verifier complaining about unsafe pointers or some such thing. This could use some further investigation, but at least right now I'm not sure how to get this to work. So, I guess there's some risk of the field accesses breaking on newer kernels if offsets change instruct bpf_iter__udp
. This probably warrants some discussion.bpf_sock_term.c
is compiled once and built into the Cilium agent image, a big departure from other programs which are compiled at runtime, but hopefully not an unwelcome one. With the clang-free milestone underway, I figured I would just skip right to the end with this new program and bake it into the image. I'm not sure how well this aligns with the long term vision, so it would probably be good to get some input from @ti-mo and @dylandreimerink who are driving the clang-free effort.Control Plane Changes
Beyond the obvious plumbing that needs to happen to initialize programs and maps, this PR implements an alternative strategy for destroying sockets in the current net namespace called
BPFSocketDestroyer
and moves the original netlink-basedDestroy()
function to a newSocketDestroyer
calledNetlinkSocketDestroyer
.backendConnectionHandler
usesBPFSocketDestroyer
by default but falls back toNetlinkSocketDestroyer
if that fails, as would be the case if the current kernel does not support thebpf_sock_destroy()
kfunc.BPFSockDestroyer#Destroy()
lazy-initializes the socket iterator program, sets the value in the socket filter map, initializes the socket iterator, then iterates to completion. It takes a lock over this whole sequence to ensure calls toDestroy()
are serialized across parallel instances ofTerminateUDPConnectionsToBackend()
.Test
This PR includes BPF unit tests as well as integration tests for netlink and BPF socket destroyer variants, but here's a small demo of the BPF socket destroyer working with a workload I put together earlier to test connected UDP sockets.
Meanwhile,
Related Kernel Patches
I made some improvements upstream that eliminate scenarios where some sockets may be skipped or repeated during iteration for TCP and UDP socket iterators. This should make socket iterators as a mechanism for this sort of thing a bit more reliable.
[RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket iterators
[PATCH v7 bpf-next 0/7] bpf: udp: Exactly-once socket iteration
[PATCH v6 bpf-next 00/12] bpf: tcp: Exactly-once socket iteration
Fixes: #37907