-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cilium: netkit support #32429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilium: netkit support #32429
Conversation
45e9d67
to
cbba89b
Compare
Add netkit as well as netkit-l2 mode as another option next to veth. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Two minor nits about an error msgs, but overall pretty good.
Add netkit support to Cilium. netkit is a new, minimal BPF-programmable device we presented at LSF/MM/BPF, LPC and KubeCons some time ago. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. This PR supports both netkit operation modes with the L3 mode being the main/recommended one. Note: netkit in L3 mode has an all-zero mac, so the eth header has dst == src == zero mac, hence the tests for mac len in the template code as node mac is not needed/used. Also in L3 mode any ARP responder is compiled out as the Pod with netkit does not attempt to resolve mac addresses (it's a NOARP device). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231024214904.29825-1-daniel@iogearbox.net Link: https://sched.co/1R2s5
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I think attachSKBProgram
gets much clearer if we return early when attaching to netkit devices and tcx and netkit link removal can be merged into one function.
|
||
// hasCiliumNetkitLinks returns true if device has a Cilium-managed | ||
// netkit program with the given attach type. | ||
func hasCiliumNetkitLinks(device netlink.Link, attach ebpf.AttachType) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this function is almost exactly the same as hasCiliumTCXLinks
. Maybe we can merge those two or factor out the similarities as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense, I'll look into this as followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great
3a1bb09
to
309ad35
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6a935ae
to
675d688
Compare
/test |
Add a similar loading mechanism as we recently got with tcx, but for netkit attachments. Given both utilize bpf_mprog underneath the API looks very similar. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. For local Pod-Pod on the same node, it is slightly different in that as opposed to veth devices we do not use bpf_redirect_peer() to go from ingress->ingress since we're on egress for the case of netkit. Thus egress->egress needs to use the regular bpf_redirect(). From a performance PoV there is no difference since both cases need to go via per-CPU backlog once. The detection whether we are on ingress or egress side is based on the ingress_ifindex. In case of netkit, it is set to 0 shortly before via skb scrubbing. In case of tcx at the ingress side it's set to the host facing veth (> 0). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Dump the status for the used device mode into cilium status, so that this is picked up by the sysdump. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Both just remove the link file, therefore create a common helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add bpf.datapathMode to Helm, so that netkit can be selected as alternative to veth. Example via Cilium CLI: ./cilium-cli install --wait --chart-directory=install/kubernetes/cilium --helm-set=hubble.eventBufferCapacity=65535 --helm-set=bpf.monitorAggregation=none --helm-set=cluster.name=default --nodes-without-cilium --helm-set-string=kubeProxyReplacement=true --set='' --helm-set=image.repository=localhost:5000/cilium/cilium-dev --helm-set=image.useDigest=false --helm-set=image.tag=local --helm-set=image.pullPolicy=IfNotPresent --helm-set=operator.image.repository=localhost:5000/cilium/operator --helm-set=operator.image.suffix="" --helm-set=operator.image.tag=local --helm-set=operator.image.pullPolicy=IfNotPresent --helm-set=operator.image.useDigest=false --helm-set-string=bpf.datapathMode=netkit --helm-set=ipv6.enabled=true --helm-set=tunnelProtocol=geneve Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a section on netkit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a small summary to help guide users. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super exciting, I can't wait to see this released. I left a small nit in the tuning page about the helm settings, but not a blocker. The Recommendation
section is a great addition. Will netkit-l2 and the differences between l3 and l2 be documented as a follow-up?
--set enableIPv6BIGTCP=true \\ | ||
--set ipv4.enabled=true \\ | ||
--set enableIPv4BIGTCP=true \\ | ||
--set kubeProxyReplacement=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=${API_SERVER_PORT}
https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#quick-start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will add this as a follow-up given the older/existing sections need this too. Between netkit and netkit-l2 there should not be much diff in terms of performance, netkit in L3 mode basically tells the Pod that it doesn't need to resolve ARP which is great since it's not really necessary anyway so we can compile out our BPF ARP responder and such, keeps things simpler as it should be. I can add a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan, thank you!
To learn more on netkit, see also: https://sched.co/1R2s5
Connectivity tests for netkit and netkit-l2 are passing including netkit flavors + BPF host routing / tunneling / legacy routing.
For local netkit-l2 testing, the following merged kernel patches are needed:
For local netkit testing with tunneling, the following merged kernel patches are needed:
These kernel patches are being in process to get to stable trees as well as Ubuntu 24.04 LTS (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2068655).
Once kernel patches are all in, I'll extend the e2e connectivity tests with netkit. Tracked in : #32939
LVH PR: cilium/little-vm-helper-images#525
Merged e2e tests: #33005