-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Workaround IPv6 underlay checksum issue #39279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In the subsequent commit, we'll need to know in nodeport_rev_dnat_ipv6 if the packet being processed is egressing or ingressing the node. This commit passes that information via the tail calls. The approach taken here is to introduce a new tail call to differentiate the call sites of ingress and egress. Another approach could use skb->cb to pass the information. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Refill your drink. This is going to be a bit long... -- CONTEXT -- This commit works around a bug in the bpf_l4_csum_replace BPF helper, which we use to update the L4 checksum after reverse NATing the IPv6 source address. When doing so, we typically use bpf_csum_diff first to compute the checksum diff, then replace it via bpf_l4_csum_replace. That calls inet_proto_csum_replace_by_diff [1]: 1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb, 2: __wsum diff, bool pseudohdr) 3: { 4: if (skb->ip_summed != CHECKSUM_PARTIAL) { 5: csum_replace_by_diff(sum, diff); 6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr) 7: skb->csum = ~csum_sub(diff, skb->csum); 8: } else if (pseudohdr) { 9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum))); 10: } 11: } sum points to the L4 header checksum. We typically pass the BPF_F_PSEUDO_HDR flag such that pseudohdr is true. -- THE BUG -- The bug happens when we're in the CHECKSUM_COMPLETE state. We've just updated one of the IPv6 addresses. Our helper now updates the L4 header checksum on line 5. Next we'll update the skb->csum on line 7. We shouldn't. For an IPv6 packet, the updates of the IPv6 address and of the L4 checksum will cancel each other. The checksums are computed such that computing a checksum over the packet including its checksum will result in a sum of 0. So the same is true here when we update the L4 checksum on line 5. We'll update it as to cancel the previous IPv6 address update. Hence skb->csum should remain untouched in this case. -- WHY IPV4 IS FINE -- The same issue doesn't affect IPv4 packets because, in that case, three fields are updated: the IPv4 address, the IP checksum, and the L4 checksum. The change to the IPv4 address and one of the checksums still cancel each other in skb->csum, but we're left with one checksum update and should therefore update skb->csum accordingly. That's exactly what inet_proto_csum_replace_by_diff does. -- WHY NATIVE ROUTING IS FINE -- This bug doesn't happen in native routing mode. Since we didn't have support for an IPv6 underlay until recently, we didn't hit this bug before. The pwru trace below shows what happens in IPv6 native routing mode, with the checksum state at each function when non-zero. When the packet to be reverse NATed is received, its skb->csum is computed and the state is set to CHECKSUM_COMPLETE (0x2). However, when that same packet is forwarded, its skb->csum is nulled and the state set to CHECKSUM_NONE (0x0) by skb_forward_csum [2]. As a result, when we get into inet_proto_csum_replace_by_diff, we don't update skb->csum. SKB IFACE LEN TUPLE FUNC 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ipv6_rcv [...] 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) __skb_checksum_complete 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) udp_v6_early_demux .ip_summed = (__u8)0x2, .ip_summed = (__u8)0x2, 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_route_input .ip_summed = (__u8)0x2, .ip_summed = (__u8)0x2, 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_forward .ip_summed = (__u8)0x2, .ip_summed = (__u8)0x2, 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) nf_hook_slow 0xffffa08b76f7c800 eth0 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_output 0xffffa08b76f7c800 cilium_host 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) nf_hook_slow 0xffffa08b76f7c800 cilium_host 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) apparmor_ip_postroute 0xffffa08b76f7c800 cilium_host 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_finish_output 0xffffa08b76f7c800 cilium_host 208 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_finish_output2 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) __dev_queue_xmit 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) qdisc_pkt_len_init 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:96::a]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable 0xffffa08b76f7c800 cilium_host 222 [fd00:10:96::a]:53->[fd00:10:244:2::830f]:35250(udp) inet_proto_csum_replace_by_diff -- THE WORKAROUND -- To really fix this bug, we would need to patch the kernel to either fix bpf_l4_csum_replace or introduce another helper that can properly update the L4 checksum for IPv6 address changes. This commit introduces a workaround until such a long-term solution is available upstream. Let me copy inet_proto_csum_replace_by_diff here again to save some scrolling: 1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb, 2: __wsum diff, bool pseudohdr) 3: { 4: if (skb->ip_summed != CHECKSUM_PARTIAL) { 5: csum_replace_by_diff(sum, diff); 6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr) 7: skb->csum = ~csum_sub(diff, skb->csum); 8: } else if (pseudohdr) { 9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum))); 10: } 11: } What we want is to avoid updating skb->csum for IPv6 packets unless we're in state CHECKSUM_PARTIAL. In our case, we expect CHECKSUM_PARTIAL is only ever set on egress (for checksum offloading). Conversely, the reverse NATing of IPv6 traffic that is affected by the bug is only happening on ingress. In inet_proto_csum_replace_by_diff, we can use the pseudohdr flag to prevent an update of skb->csum. Given the above ingress/egress distinction, we can set pseudohdr to true only on egress, when we may be in CHECKSUM_PARTIAL state and do want to update skb->csum. On ingress, we leave pseudohdr unset, such that skb->csum is never updated. -- CONCLUSION -- We have an issue in a BPF helper. We need a proper solution upstream, but have a Cilium workaround for now. You just read a very long commit and you can't even complain because this is only the 14th longest commit in the Cilium codebase. 1 - https://elixir.bootlin.com/linux/v6.13.7/source/net/core/utils.c#L475 2 - https://elixir.bootlin.com/linux/v6.13.7/source/include/linux/skbuff.h#L5109 Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
The previous commit fix a bug when running with per-packet load balancing (ex. under kube-proxy) and an IPv6 underlay. We should cover that configuration in CI. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
db84e76
to
8547974
Compare
/test |
aanm
approved these changes
May 2, 2025
christarazi
approved these changes
May 5, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
feature/ipv6
Relates to IPv6 protocol support
release-note/misc
This PR makes changes that have no direct user impact.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request works around a bug in the
bpf_l4_csum_replace
BPF helper, which we use to update the L4 checksum after reverse NATing the IPv6 source address. The bug was only exposed once we started to support IPv6 underlays in #38296. CI didn't fail because I didn't introduce a test case covering this (on purpose, I didn't want to block my PR 😜).The second commit has the main explanation and workaround logic. The first commit refactors tail calls a bit to provide information needed in the workaround. The third commit adds CI coverage.