Skip to content

Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled #33168

@jschwinger233

Description

@jschwinger233

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Steps:

  1. Kind install cilium
export IMAGE=kindest/node:v1.29.4@sha256:3abb816a5b1061fb15c6e9e60856ec40d56b7b52bcea5f5f1350bc6e2320b6f8
     ./contrib/scripts/kind.sh --xdp --secondary-network "" 3 "" "" none dual 0.0.0.0 6443
 kubectl patch node kind-worker3 --type=json -p='[{"op":"add","path":"/metadata/labels/cilium.io~1no-schedule","value":"true"}]'

  if [[ "gcm(aes)" == "gcm(aes)" ]]; then
    key="rfc4106(gcm(aes)) $(dd if=/dev/urandom count=20 bs=1 2> /dev/null | xxd -p -c 64) 128"
  elif [[ "gcm(aes)" == "cbc(aes)" ]]; then
    key="hmac(sha256) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64) cbc(aes) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64)"
  else
    echo "Invalid key type"; exit 1
  fi
  kubectl create -n kube-system secret generic cilium-ipsec-keys \
    --from-literal=keys="3+ ${key}"
./cilium-cli install --wait     --chart-directory=./install/kubernetes/cilium     --helm-set=debug.enabled=true     --helm-set=debug.verbose=envoy     --helm-set=hubble.eventBufferCapacity=65535     --helm-set=bpf.monitorAggregation=none     --helm-set=cluster.name=default     --helm-set=authentication.mutual.spire.enabled=false     --nodes-without-cilium     --helm-set-string=kubeProxyReplacement=true     --set='' --helm-set=image.repository=quay.io/cilium/cilium-ci     --helm-set=image.useDigest=false     --helm-set=image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.repository=quay.io/cilium/operator     --helm-set=operator.image.suffix=-ci     --helm-set=operator.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.useDigest=false     --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci     --helm-set=hubble.relay.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=hubble.relay.image.useDigest=false --helm-set-string=routingMode=native --helm-set-string=autoDirectNodeRoutes=true --helm-set-string=ipv4NativeRoutingCIDR=10.244.0.0/16 --helm-set-string=ipv6NativeRoutingCIDR=fd00:10:244::/56   --helm-set-string=endpointRoutes.enabled=true --helm-set=ipv6.enabled=true --helm-set=bpf.masquerade=true  --helm-set=encryption.enabled=true --helm-set=encryption.type=ipsec --helm-set=encryption.nodeEncryption=false
  1. Install cilium-cli-next
 cid=$(docker create quay.io/cilium/cilium-cli-ci:5401ce3551cc46052489b7153468b577830a63a4 ls)
  docker cp $cid:/usr/local/bin/cilium .//cilium-cli-next
  docker rm $cid
  1. Run connectivity test and see failures
./cilium-cli-next connectivity test --include-unsafe-tests  --flush-ct --test "pod-to-pod-with-l7-policy-encryption/"  -v  -p

Please note it only happens when both ingress policy and egress policy are working at the same time. Once deleting either ingress/egress policy, the connectivity is back.

Cilium Version

$ ./cilium-cli version
cilium-cli: v0.16.7 compiled with go1.22.2 on linux/amd64
cilium image (default): v1.15.4
cilium image (stable): v1.15.6
cilium image (running): 1.16.0-dev

Kernel Version

$ uname -a
Linux liangzc-l-PF4RDLEQ 6.5.0-1024-oem #25-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 14:47:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

This is yet another MTU issue, ipsec xfrm makes it even harder.

Fact one: IPsec xfrm reduces MTU from 1500 to 1446.

Even if all the net ifaces on the cilium node have been set MTU 1500 (even for those in 2005 route table), xfrm makes MTU 1446 on the sneak.

This happens in the ip_forward():

// https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/ip_forward.c#L124
int ip_forward(struct sk_buff *skb)
{
[...]
	if (!xfrm4_route_forward(skb)) {
		SKB_DR_SET(reason, XFRM_POLICY);
		goto drop;
	}
	rt = skb_rtable(skb);

	if (opt->is_strictroute && rt->rt_uses_gateway)
		goto sr_failed;

	IPCB(skb)->flags |= IPSKB_FORWARDED;
	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
	if (ip_exceeds_mtu(skb, mtu)) {
[...]
}

We know that before entering ip_forward() in ip_route_input_noref() , the skb has been set skb->_skb_refdst, which is used to determine MTU. However, the code above indicates the existence of xfrm could change the MTU inside ip_forward(): xfrm4_route_forward() can change skb->_skb_refdst to the result of xfrm_lookup_with_ifid().

The xfrm MTU has a special algorithm at https://elixir.bootlin.com/linux/v6.2/source/net/xfrm/xfrm_state.c#L2747, I didn't check all the details but I did use bpftrace to fetch the new MTU from xfrm4_route_forward(), in our cilium case the MTU indeed is reduced from 1500 to 1446.

Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod

Either pod to pod or pod to remote ingress proxy has MTU 1423, which is set on the route instead of iface:

$ nspod client2-ccd7b8bdf-nt4nn ip r
default via 10.244.1.12 dev eth0 mtu 1423 
10.244.1.12 dev eth0 scope link

However, proxy to proxy traffic has MTU 1500. This subtle change explains why we see issues only if ingress and egress policy are installed together, the scenario we never covered in test before.

Proposed solution

It seems we can just change MTU for routing in the 2005 table, because proxy traffic will always end up there due to 0xa00/0xb00 mark.

For example, the current routes in 2005 table are

$ nscontainer kind-control-plane ip r s t 2005
default via 10.244.0.100 dev cilium_host proto kernel 
10.244.0.100 dev cilium_host proto kernel scope link 

We could change that to

default via 10.244.0.100 dev cilium_host proto kernel mtu 1446
10.244.0.100 dev cilium_host proto kernel scope link mtu 1446

(Haven't checked IPv6)

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

affects/v1.13This issue affects v1.13 branchaffects/v1.14This issue affects v1.14 branchaffects/v1.15This issue affects v1.15 brancharea/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.area/encryptionImpacts encryption support such as IPSec, WireGuard, or kTLS.area/mtuRelates to MTU management in Cilium.area/proxyImpacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.feature/ipsecRelates to Cilium's IPsec featurefeature/ipv6Relates to IPv6 protocol supportkind/bugThis is a bug in the Cilium logic.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions