Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

Steps:

1. Kind install cilium

```
export IMAGE=kindest/node:v1.29.4@sha256:3abb816a5b1061fb15c6e9e60856ec40d56b7b52bcea5f5f1350bc6e2320b6f8
     ./contrib/scripts/kind.sh --xdp --secondary-network "" 3 "" "" none dual 0.0.0.0 6443
 kubectl patch node kind-worker3 --type=json -p='[{"op":"add","path":"/metadata/labels/cilium.io~1no-schedule","value":"true"}]'

  if [[ "gcm(aes)" == "gcm(aes)" ]]; then
    key="rfc4106(gcm(aes)) $(dd if=/dev/urandom count=20 bs=1 2> /dev/null | xxd -p -c 64) 128"
  elif [[ "gcm(aes)" == "cbc(aes)" ]]; then
    key="hmac(sha256) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64) cbc(aes) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64)"
  else
    echo "Invalid key type"; exit 1
  fi
  kubectl create -n kube-system secret generic cilium-ipsec-keys \
    --from-literal=keys="3+ ${key}"
./cilium-cli install --wait     --chart-directory=./install/kubernetes/cilium     --helm-set=debug.enabled=true     --helm-set=debug.verbose=envoy     --helm-set=hubble.eventBufferCapacity=65535     --helm-set=bpf.monitorAggregation=none     --helm-set=cluster.name=default     --helm-set=authentication.mutual.spire.enabled=false     --nodes-without-cilium     --helm-set-string=kubeProxyReplacement=true     --set='' --helm-set=image.repository=quay.io/cilium/cilium-ci     --helm-set=image.useDigest=false     --helm-set=image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.repository=quay.io/cilium/operator     --helm-set=operator.image.suffix=-ci     --helm-set=operator.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=operator.image.useDigest=false     --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci     --helm-set=hubble.relay.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3     --helm-set=hubble.relay.image.useDigest=false --helm-set-string=routingMode=native --helm-set-string=autoDirectNodeRoutes=true --helm-set-string=ipv4NativeRoutingCIDR=10.244.0.0/16 --helm-set-string=ipv6NativeRoutingCIDR=fd00:10:244::/56   --helm-set-string=endpointRoutes.enabled=true --helm-set=ipv6.enabled=true --helm-set=bpf.masquerade=true  --helm-set=encryption.enabled=true --helm-set=encryption.type=ipsec --helm-set=encryption.nodeEncryption=false
```

2. Install cilium-cli-next
```
 cid=$(docker create quay.io/cilium/cilium-cli-ci:5401ce3551cc46052489b7153468b577830a63a4 ls)
  docker cp $cid:/usr/local/bin/cilium .//cilium-cli-next
  docker rm $cid
```

3. Run connectivity test and see failures
```
./cilium-cli-next connectivity test --include-unsafe-tests  --flush-ct --test "pod-to-pod-with-l7-policy-encryption/"  -v  -p
```

Please note it only happens when both ingress policy and egress policy are working at the same time. Once deleting either ingress/egress policy, the connectivity is back.

### Cilium Version

```
$ ./cilium-cli version
cilium-cli: v0.16.7 compiled with go1.22.2 on linux/amd64
cilium image (default): v1.15.4
cilium image (stable): v1.15.6
cilium image (running): 1.16.0-dev
```

### Kernel Version

```
$ uname -a
Linux liangzc-l-PF4RDLEQ 6.5.0-1024-oem #25-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 14:47:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
```

### Kubernetes Version

```
$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
```

### Regression

_No response_

### Sysdump

_No response_

### Relevant log output

_No response_

### Anything else?

This is yet another MTU issue, ipsec xfrm makes it even harder.

## Fact one: IPsec xfrm reduces MTU from 1500 to 1446.

Even if all the net ifaces on the cilium node have been set MTU 1500 (even for those in 2005 route table), xfrm makes MTU 1446 on the sneak.

This happens in the ip_forward():

```
// https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/ip_forward.c#L124
int ip_forward(struct sk_buff *skb)
{
[...]
	if (!xfrm4_route_forward(skb)) {
		SKB_DR_SET(reason, XFRM_POLICY);
		goto drop;
	}
	rt = skb_rtable(skb);

	if (opt->is_strictroute && rt->rt_uses_gateway)
		goto sr_failed;

	IPCB(skb)->flags |= IPSKB_FORWARDED;
	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
	if (ip_exceeds_mtu(skb, mtu)) {
[...]
}
```

We know that before entering `ip_forward()` in `ip_route_input_noref()` , the skb has been set `skb->_skb_refdst`, which is used to determine MTU. However, the code above indicates the existence of xfrm could change the MTU inside `ip_forward()`: `xfrm4_route_forward()` can change `skb->_skb_refdst` to the result of `xfrm_lookup_with_ifid()`. 

The xfrm MTU has a special algorithm at https://elixir.bootlin.com/linux/v6.2/source/net/xfrm/xfrm_state.c#L2747, I didn't check all the details but I did use bpftrace to fetch the new MTU from `xfrm4_route_forward()`, in our cilium case the MTU indeed is reduced from 1500 to 1446.

## Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod 

Either pod to pod or pod to remote ingress proxy has MTU 1423, which is set on the route instead of iface:

```
$ nspod client2-ccd7b8bdf-nt4nn ip r
default via 10.244.1.12 dev eth0 mtu 1423 
10.244.1.12 dev eth0 scope link
```

However, proxy to proxy traffic has MTU 1500. This subtle change explains why we see issues only if ingress and egress policy are installed together, the scenario we never covered in test before.

## Proposed solution

It seems we can just change MTU for routing in the 2005 table, because proxy traffic will always end up there due to 0xa00/0xb00 mark.

For example, the current routes in 2005 table are

```
$ nscontainer kind-control-plane ip r s t 2005
default via 10.244.0.100 dev cilium_host proto kernel 
10.244.0.100 dev cilium_host proto kernel scope link 
```

We could change that to

```
default via 10.244.0.100 dev cilium_host proto kernel mtu 1446
10.244.0.100 dev cilium_host proto kernel scope link mtu 1446
```

(Haven't checked IPv6)


### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled #33168

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Fact one: IPsec xfrm reduces MTU from 1500 to 1446.

Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod

Proposed solution

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pod to pod packets via egress + ingress proxy are MTU dropped when IPsec is enabled #33168

Description

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Fact one: IPsec xfrm reduces MTU from 1500 to 1446.

Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod

Proposed solution

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions