-
Notifications
You must be signed in to change notification settings - Fork 3.4k
bpf: Remove link scope of cilium_host
's IPv4 address
#23241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: Remove link scope of cilium_host
's IPv4 address
#23241
Conversation
Kube-proxy always masquerades DNATed packets going to NodePort services. This is to ensure that reply packets always flow through the intermediate, DNATing node. Consider the following path: pod@node1 -> nodeport@node2 -> backend@node3 A packet is sent from pod@node1 to a NodePort service with node2's IP address. Node2 DNATs the packet and forwards it to the backend on node3. If node2 doesn't also masquerade the packet, the reply packet will be sent directly to node1, bypassing the reverse DNAT. In tunneling mode however, kube-proxy appears unable to pick the correct source IP for masquerading. Consider the following packet flow (under VXLAN + endpoint routes + IPsec [1]): <- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN -> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN <- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN -> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN <- stack encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN -> overlay encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That packet is masqueraded to 10.0.2.15 (line 3), the IP on the default interface. This choice is incorrect as the packet will then go through the tunnel and not the underlay. The reply will therefore not be sent through the tunnel and may even fail if 10.0.2.15 isn't routable from node 2 (as is the case in our testing setup). Instead, kube-proxy should pick the IP address of cilium_host, which belongs to the node's pod CIDR, thus ensuring the reply will be routed through the tunnel. Why isn't it? Checking the kernel's source code [2], we can see that the scope of IP addresses on the interfaces is taken into account in addition to the destination IP (and other packet information in case of source routing, etc.). Specifically, in the case of netfilter's masquerading, inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0). Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE will be picked. This commit thus removes the link scope on the IPv4 address of cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope (default). This will be tested in the Cilium Datapath workflow via a subsequent pull request, but we need to fix one other bug before we can do that. 1 - IPsec doesn't matter to the bug here. Endpoint routes however does. If endpoint routes is enabled, Cilium adds a masquerading rule in front of kube-proxy's to always masquerade DNATed pod traffic to cilium_host IP address. See [3] for details. 2 - https://github.com/torvalds/linux/blob/v5.19/net/ipv4/devinet.c#L1324 3 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/pkg/datapath/iptables/iptables.go#L1216-L1242 Co-authored-by: Liu Xu <liuxu623@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
Previous run failed in https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/514/testReport/Suite-k8s-1/26/K8sDatapathServicesTest_Checks_N_S_loadbalancing_Tests_with_direct_routing_and_DSR/. Seems likely to be a flake given this change shouldn't affect KPR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! That must have taken a while to debug, only to find its a one line change.
I had help :-) While debugging this I had in mind #21738, which made the same change before. And Daniel also pointed to the kernel code when I asked about the masquerading logic. |
Kube-proxy always masquerades DNATed packets going to NodePort services. This is to ensure that reply packets always flow through the intermediate, DNATing node. Consider the following path:
A packet is sent from pod@node1 to a NodePort service with node2's IP address. Node2 DNATs the packet and forwards it to the backend on node3. If node2 doesn't also masquerade the packet, the reply packet will be sent directly to node1, bypassing the reverse DNAT.
In tunneling mode however, kube-proxy appears unable to pick the correct source IP for masquerading. Consider the following packet flow (under VXLAN + endpoint routes + IPsec [1]):
Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That packet is masqueraded to 10.0.2.15 (line 3), the IP on the default interface. This choice is incorrect as the packet will then go through the tunnel and not the underlay. The reply will therefore not be sent through the tunnel and may even fail if 10.0.2.15 isn't routable from node 2 (as is the case in our testing setup).
Instead, kube-proxy should pick the IP address of
cilium_host
, which belongs to the node's pod CIDR, thus ensuring the reply will be routed through the tunnel. Why isn't it?Checking the kernel's source code [2], we can see that the scope of IP addresses on the interfaces is taken into account in addition to the destination IP (and other packet information in case of source routing, etc.). Specifically, in the case of netfilter's masquerading,
inet_select_addr
is called with a scope ofRT_SCOPE_UNIVERSE
(0). Therefore, only IP addresses with a scope equal toRT_SCOPE_UNIVERSE
will be picked.This pull request thus removes the link scope on the IPv4 address of
cilium_host
, such that the address now has aRT_SCOPE_UNIVERSE
scope (default).This will be tested in the Cilium Datapath workflow via a subsequent pull request, but we need to fix one other bug before we can do that.
1 - IPsec doesn't matter to the bug here. Endpoint routes however does. If endpoint routes is enabled, Cilium adds a masquerading rule in front of kube-proxy's to always masquerade DNATed pod traffic to
cilium_host
IP address. See [3] for details.Co-authored-by: Liu Xu liuxu623@gmail.com
Signed-off-by: Paul Chaignon paul@cilium.io