-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Steps:
- Kind install cilium
export IMAGE=kindest/node:v1.29.4@sha256:3abb816a5b1061fb15c6e9e60856ec40d56b7b52bcea5f5f1350bc6e2320b6f8
./contrib/scripts/kind.sh --xdp --secondary-network "" 3 "" "" none dual 0.0.0.0 6443
kubectl patch node kind-worker3 --type=json -p='[{"op":"add","path":"/metadata/labels/cilium.io~1no-schedule","value":"true"}]'
if [[ "gcm(aes)" == "gcm(aes)" ]]; then
key="rfc4106(gcm(aes)) $(dd if=/dev/urandom count=20 bs=1 2> /dev/null | xxd -p -c 64) 128"
elif [[ "gcm(aes)" == "cbc(aes)" ]]; then
key="hmac(sha256) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64) cbc(aes) $(dd if=/dev/urandom count=32 bs=1 2> /dev/null| xxd -p -c 64)"
else
echo "Invalid key type"; exit 1
fi
kubectl create -n kube-system secret generic cilium-ipsec-keys \
--from-literal=keys="3+ ${key}"
./cilium-cli install --wait --chart-directory=./install/kubernetes/cilium --helm-set=debug.enabled=true --helm-set=debug.verbose=envoy --helm-set=hubble.eventBufferCapacity=65535 --helm-set=bpf.monitorAggregation=none --helm-set=cluster.name=default --helm-set=authentication.mutual.spire.enabled=false --nodes-without-cilium --helm-set-string=kubeProxyReplacement=true --set='' --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3 --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3 --helm-set=operator.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=412a46c753eaeff229cba3d83332e2dea3e192f3 --helm-set=hubble.relay.image.useDigest=false --helm-set-string=routingMode=native --helm-set-string=autoDirectNodeRoutes=true --helm-set-string=ipv4NativeRoutingCIDR=10.244.0.0/16 --helm-set-string=ipv6NativeRoutingCIDR=fd00:10:244::/56 --helm-set-string=endpointRoutes.enabled=true --helm-set=ipv6.enabled=true --helm-set=bpf.masquerade=true --helm-set=encryption.enabled=true --helm-set=encryption.type=ipsec --helm-set=encryption.nodeEncryption=false
- Install cilium-cli-next
cid=$(docker create quay.io/cilium/cilium-cli-ci:5401ce3551cc46052489b7153468b577830a63a4 ls)
docker cp $cid:/usr/local/bin/cilium .//cilium-cli-next
docker rm $cid
- Run connectivity test and see failures
./cilium-cli-next connectivity test --include-unsafe-tests --flush-ct --test "pod-to-pod-with-l7-policy-encryption/" -v -p
Please note it only happens when both ingress policy and egress policy are working at the same time. Once deleting either ingress/egress policy, the connectivity is back.
Cilium Version
$ ./cilium-cli version
cilium-cli: v0.16.7 compiled with go1.22.2 on linux/amd64
cilium image (default): v1.15.4
cilium image (stable): v1.15.6
cilium image (running): 1.16.0-dev
Kernel Version
$ uname -a
Linux liangzc-l-PF4RDLEQ 6.5.0-1024-oem #25-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 14:47:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
Regression
No response
Sysdump
No response
Relevant log output
No response
Anything else?
This is yet another MTU issue, ipsec xfrm makes it even harder.
Fact one: IPsec xfrm reduces MTU from 1500 to 1446.
Even if all the net ifaces on the cilium node have been set MTU 1500 (even for those in 2005 route table), xfrm makes MTU 1446 on the sneak.
This happens in the ip_forward():
// https://elixir.bootlin.com/linux/v6.2/source/net/ipv4/ip_forward.c#L124
int ip_forward(struct sk_buff *skb)
{
[...]
if (!xfrm4_route_forward(skb)) {
SKB_DR_SET(reason, XFRM_POLICY);
goto drop;
}
rt = skb_rtable(skb);
if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;
IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
if (ip_exceeds_mtu(skb, mtu)) {
[...]
}
We know that before entering ip_forward()
in ip_route_input_noref()
, the skb has been set skb->_skb_refdst
, which is used to determine MTU. However, the code above indicates the existence of xfrm could change the MTU inside ip_forward()
: xfrm4_route_forward()
can change skb->_skb_refdst
to the result of xfrm_lookup_with_ifid()
.
The xfrm MTU has a special algorithm at https://elixir.bootlin.com/linux/v6.2/source/net/xfrm/xfrm_state.c#L2747, I didn't check all the details but I did use bpftrace to fetch the new MTU from xfrm4_route_forward()
, in our cilium case the MTU indeed is reduced from 1500 to 1446.
Fact two: traffic from (local egress) proxy to (remote ingress) proxy has a different MTU from pod to pod
Either pod to pod or pod to remote ingress proxy has MTU 1423, which is set on the route instead of iface:
$ nspod client2-ccd7b8bdf-nt4nn ip r
default via 10.244.1.12 dev eth0 mtu 1423
10.244.1.12 dev eth0 scope link
However, proxy to proxy traffic has MTU 1500. This subtle change explains why we see issues only if ingress and egress policy are installed together, the scenario we never covered in test before.
Proposed solution
It seems we can just change MTU for routing in the 2005 table, because proxy traffic will always end up there due to 0xa00/0xb00 mark.
For example, the current routes in 2005 table are
$ nscontainer kind-control-plane ip r s t 2005
default via 10.244.0.100 dev cilium_host proto kernel
10.244.0.100 dev cilium_host proto kernel scope link
We could change that to
default via 10.244.0.100 dev cilium_host proto kernel mtu 1446
10.244.0.100 dev cilium_host proto kernel scope link mtu 1446
(Haven't checked IPv6)
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct