Skip to content

Broken connectivity when using BFP masquerade, no tunnel and BPF Host Routing on EKS with 1 device #27343

@margamanterola

Description

@margamanterola

I tracked down this issue to a change that happened between v1.14.0-snapshot.1 and v1.14.0-snapshot.2

On an EKS cluster, packets that are supposed to go to the EKS kube-apiserver get silently dropped when this Helm config is applied:

bpf:
  masquerade: true
devices: eth0
eni:
  enabled: true
ipam:
  mode: eni
kubeProxyReplacement: strict
tunnel: disabled

Disabling BPF masquerading, using eth+ on the devices or setting vxlan for the tunnel, all lead to a working communication with the kube-apiserver. KRP is unrelated, but it needs to be set to some value when using the 1.14.0 chart due to the default value changing to false.

Same Helm configuration, but as a command (for easier testing):

helm upgrade -n kube-system cilium cilium/cilium --version 1.14.0  \
  --set image.tag=v1.14.0-snapshot.2 --set image.useDigest=false \
  --set bpf.masquerade=true \
  --set devices=eth0 \
  --set tunnel=disabled \
  --set eni.enabled=true \
  --set ipam.mode=eni \
  --set kubeProxyReplacement=strict

When setting devices: eth+, I can see with cilium monitor that the packet is now being routed through eth1.

With 1.13.5, the packet goes and comes back over ifindex 0.

Policy verdict log: flow 0x93b86448 local EP ID 3678, remote ID 16777218, proto 6, egress, action allow, match L3-L4, 10.4.201.117:40748 -> 10.4.1.206:443 tcp SYN
-> stack flow 0x93b86448 , identity 35566->16777218 state new ifindex 0 orig-ip 0.0.0.0: 10.4.201.117:40748 -> 10.4.1.206:443 tcp SYN
-> endpoint 3678 flow 0xe857e857 , identity 16777218->35566 state reply ifindex 0 orig-ip 10.4.1.206: 10.4.1.206:443 -> 10.4.201.117:40748 tcp SYN, ACK
-> stack flow 0x93b86448 , identity 35566->16777218 state established ifindex 0 orig-ip 0.0.0.0: 10.4.201.117:40748 -> 10.4.1.206:443 tcp ACK

With 1.14.0-snapshot.2 & eth0, it sends the packet on eth0 but the syn,ack never gets back:

Policy verdict log: flow 0xcf0da62e local EP ID 3678, remote ID 16777268, proto 6, egress, action allow, match L4-Only, 10.4.201.117:51344 -> 10.4.1.206:443 tcp SYN
-> network flow 0xcf0da62e , identity 35566->16777268 state new ifindex eth0 orig-ip 0.0.0.0: 10.4.201.117:51344 -> 10.4.1.206:443 tcp SYN

With 1.14.0-snapshot.2 & eth+, the packet gets sent on eth1 and we get the syn,ack back:

Policy verdict log: flow 0x2d05b7a4 local EP ID 3678, remote ID 16777217, proto 6, egress, action allow, match L4-Only, 10.4.201.117:35980 -> 10.4.1.206:443 tcp SYN
-> network flow 0x2d05b7a4 , identity 35566->16777217 state new ifindex eth1 orig-ip 0.0.0.0: 10.4.201.117:35980 -> 10.4.1.206:443 tcp SYN
-> endpoint 3678 flow 0xa6c8a6c8 , identity 16777217->35566 state reply ifindex lxc6bedd6da59c0 orig-ip 10.4.1.206: 10.4.1.206:443 -> 10.4.201.117:35980 tcp SYN, ACK
-> network flow 0x2d05b7a4 , identity 35566->16777217 state established ifindex eth1 orig-ip 0.0.0.0: 10.4.201.117:35980 -> 10.4.1.206:443 tcp ACK

When discussing this with Julian, he made me realize that #22006 (merged during the mentioned window) made changes around bpf host routing. So I tried switching to legacy host routing instead of BPF Host Routing, and connectivity was restored.

With 1.14.0-snapshot.2, devices=eth0 and bpf.hostLegacyRouting=true. Again, syn and syn,ack on ifindex 0:

Policy verdict log: flow 0x2050dec6 local EP ID 3678, remote ID 16777268, proto 6, egress, action allow, match L4-Only, 10.4.201.117:34694 -> 10.4.1.206:443 tcp SYN
-> stack flow 0x2050dec6 , identity 35566->16777268 state new ifindex 0 orig-ip 0.0.0.0: 10.4.201.117:34694 -> 10.4.1.206:443 tcp SYN
-> endpoint 3678 flow 0xb495b495 , identity 16777268->35566 state reply ifindex 0 orig-ip 10.4.1.206: 10.4.1.206:443 -> 10.4.201.117:34694 tcp SYN, ACK
-> stack flow 0x2050dec6 , identity 35566->16777268 state established ifindex 0 orig-ip 0.0.0.0: 10.4.201.117:34694 -> 10.4.1.206:443 tcp ACK

I'm not sure if it means anything, but it's interesting that the Cilium identities for the kube-apiserver change depending on the configuration. With legacy host routing OR devices: eth0, it uses these two identities:

16777268   cidr:10.4.2.114/32
           reserved:kube-apiserver
           reserved:world
16777269   cidr:10.4.1.206/32
           reserved:kube-apiserver
           reserved:world

With BPF Host Routing AND devices: eth+, it uses these other two:

16777217   cidr:10.4.2.114/32
           reserved:kube-apiserver
           reserved:world
16777218   cidr:10.4.1.206/32
           reserved:kube-apiserver
           reserved:world

So, this particular combination of features doesn't work:

  • EKS environment (eni.enabled: true + ipam.mode: eni)
  • BPF Masquerade
  • Direct Routing (tunnel: disabled)
  • Only one interface (devices: eth0)
  • BPF Host Routing (selected by default)

This is a change in behavior. The same configuration that worked correctly up until 1.14.0-snapshot.1 stopped working after 1.14.0-snapshot.2.

If this is now expected to be the case and people need to manually switch to legacy Host Routing in this case, we would need to at least document it. But ideally we should be able to do the right thing and not have packets silently dropped.

CC: @julianwiedmann, @aspsk

Metadata

Metadata

Assignees

Labels

area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions