Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented May 1, 2025

This pull request works around a bug in the bpf_l4_csum_replace BPF helper, which we use to update the L4 checksum after reverse NATing the IPv6 source address. The bug was only exposed once we started to support IPv6 underlays in #38296. CI didn't fail because I didn't introduce a test case covering this (on purpose, I didn't want to block my PR 😜).

The second commit has the main explanation and workaround logic. The first commit refactors tail calls a bit to provide information needed in the workaround. The third commit adds CI coverage.

In the subsequent commit, we'll need to know in nodeport_rev_dnat_ipv6
if the packet being processed is egressing or ingressing the node. This
commit passes that information via the tail calls.

The approach taken here is to introduce a new tail call to differentiate
the call sites of ingress and egress. Another approach could use skb->cb
to pass the information.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. feature/ipv6 Relates to IPv6 protocol support labels May 1, 2025
Refill your drink. This is going to be a bit long...

-- CONTEXT --

This commit works around a bug in the bpf_l4_csum_replace BPF helper,
which we use to update the L4 checksum after reverse NATing the IPv6
source address.

When doing so, we typically use bpf_csum_diff first to compute the
checksum diff, then replace it via bpf_l4_csum_replace. That calls
inet_proto_csum_replace_by_diff [1]:

    1:  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
    2: 	                                     __wsum diff, bool pseudohdr)
    3:  {
    4: 	    if (skb->ip_summed != CHECKSUM_PARTIAL) {
    5: 	        csum_replace_by_diff(sum, diff);
    6: 	        if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
    7:              skb->csum = ~csum_sub(diff, skb->csum);
    8:      } else if (pseudohdr) {
    9:          *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
    10:     }
    11: }

sum points to the L4 header checksum. We typically pass the
BPF_F_PSEUDO_HDR flag such that pseudohdr is true.

-- THE BUG --

The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. Our helper now updates the L4 header
checksum on line 5.

Next we'll update the skb->csum on line 7. We shouldn't.

For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are computed such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.

-- WHY IPV4 IS FINE --

The same issue doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.

-- WHY NATIVE ROUTING IS FINE --

This bug doesn't happen in native routing mode. Since we didn't have
support for an IPv6 underlay until recently, we didn't hit this bug
before. The pwru trace below shows what happens in IPv6 native routing
mode, with the checksum state at each function when non-zero. When the
packet to be reverse NATed is received, its skb->csum is computed and
the state is set to CHECKSUM_COMPLETE (0x2). However, when that same
packet is forwarded, its skb->csum is nulled and the state set to
CHECKSUM_NONE (0x0) by skb_forward_csum [2]. As a result, when we get
into inet_proto_csum_replace_by_diff, we don't update skb->csum.

    SKB                  IFACE         LEN   TUPLE                                                     FUNC
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ipv6_rcv
    [...]
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) __skb_checksum_complete
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) udp_v6_early_demux
       .ip_summed = (__u8)0x2,
       .ip_summed = (__u8)0x2,
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_route_input
       .ip_summed = (__u8)0x2,
       .ip_summed = (__u8)0x2,
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_forward
       .ip_summed = (__u8)0x2,
       .ip_summed = (__u8)0x2,
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) nf_hook_slow
    0xffffa08b76f7c800   eth0          208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_output
    0xffffa08b76f7c800   cilium_host   208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) nf_hook_slow
    0xffffa08b76f7c800   cilium_host   208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) apparmor_ip_postroute
    0xffffa08b76f7c800   cilium_host   208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_finish_output
    0xffffa08b76f7c800   cilium_host   208   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) ip6_finish_output2
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) __dev_queue_xmit
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) qdisc_pkt_len_init
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:244:1::a77]:53->[fd00:10:244:2::830f]:35250(udp) skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:96::a]:53->[fd00:10:244:2::830f]:35250(udp)      skb_ensure_writable
    0xffffa08b76f7c800   cilium_host   222   [fd00:10:96::a]:53->[fd00:10:244:2::830f]:35250(udp)      inet_proto_csum_replace_by_diff

-- THE WORKAROUND --

To really fix this bug, we would need to patch the kernel to either fix
bpf_l4_csum_replace or introduce another helper that can properly update
the L4 checksum for IPv6 address changes. This commit introduces a
workaround until such a long-term solution is available upstream.

Let me copy inet_proto_csum_replace_by_diff here again to save some
scrolling:

    1:  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
    2:                                       __wsum diff, bool pseudohdr)
    3:  {
    4:      if (skb->ip_summed != CHECKSUM_PARTIAL) {
    5:          csum_replace_by_diff(sum, diff);
    6:          if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
    7:              skb->csum = ~csum_sub(diff, skb->csum);
    8:      } else if (pseudohdr) {
    9:          *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
    10:     }
    11: }

What we want is to avoid updating skb->csum for IPv6 packets unless
we're in state CHECKSUM_PARTIAL. In our case, we expect CHECKSUM_PARTIAL
is only ever set on egress (for checksum offloading). Conversely, the
reverse NATing of IPv6 traffic that is affected by the bug is only
happening on ingress.

In inet_proto_csum_replace_by_diff, we can use the pseudohdr flag to
prevent an update of skb->csum. Given the above ingress/egress
distinction, we can set pseudohdr to true only on egress, when we may be
in CHECKSUM_PARTIAL state and do want to update skb->csum. On ingress,
we leave pseudohdr unset, such that skb->csum is never updated.

-- CONCLUSION --

We have an issue in a BPF helper. We need a proper solution upstream,
but have a Cilium workaround for now. You just read a very long commit
and you can't even complain because this is only the 14th longest
commit in the Cilium codebase.

1 - https://elixir.bootlin.com/linux/v6.13.7/source/net/core/utils.c#L475
2 - https://elixir.bootlin.com/linux/v6.13.7/source/include/linux/skbuff.h#L5109
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno changed the title Workaround IPv6 checksum issue Workaround IPv6 underlay checksum issue May 1, 2025
The previous commit fix a bug when running with per-packet load
balancing (ex. under kube-proxy) and an IPv6 underlay. We should cover
that configuration in CI.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno force-pushed the pr/pchaigno/workaround-ipv6-checksum-issue branch from db84e76 to 8547974 Compare May 1, 2025 11:51
@pchaigno
Copy link
Member Author

pchaigno commented May 1, 2025

/test

@pchaigno pchaigno marked this pull request as ready for review May 1, 2025 12:17
@pchaigno pchaigno requested review from a team as code owners May 1, 2025 12:17
@pchaigno pchaigno requested review from ldelossa, aanm and christarazi May 1, 2025 12:17
@pchaigno pchaigno added this pull request to the merge queue May 7, 2025
Merged via the queue into main with commit e1a1b9b May 7, 2025
302 checks passed
@pchaigno pchaigno deleted the pr/pchaigno/workaround-ipv6-checksum-issue branch May 7, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/ipv6 Relates to IPv6 protocol support release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants