Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jan 23, 2023

Kube-proxy always masquerades DNATed packets going to NodePort services. This is to ensure that reply packets always flow through the intermediate, DNATing node. Consider the following path:

pod@node1 -> nodeport@node2 -> backend@node3

A packet is sent from pod@node1 to a NodePort service with node2's IP address. Node2 DNATs the packet and forwards it to the backend on node3. If node2 doesn't also masquerade the packet, the reply packet will be sent directly to node1, bypassing the reverse DNAT.

In tunneling mode however, kube-proxy appears unable to pick the correct source IP for masquerading. Consider the following packet flow (under VXLAN + endpoint routes + IPsec [1]):

<- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
-> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
<- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
-> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
<- stack encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
-> overlay encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN

Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That packet is masqueraded to 10.0.2.15 (line 3), the IP on the default interface. This choice is incorrect as the packet will then go through the tunnel and not the underlay. The reply will therefore not be sent through the tunnel and may even fail if 10.0.2.15 isn't routable from node 2 (as is the case in our testing setup).

Instead, kube-proxy should pick the IP address of cilium_host, which belongs to the node's pod CIDR, thus ensuring the reply will be routed through the tunnel. Why isn't it?

Checking the kernel's source code [2], we can see that the scope of IP addresses on the interfaces is taken into account in addition to the destination IP (and other packet information in case of source routing, etc.). Specifically, in the case of netfilter's masquerading, inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0). Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE will be picked.

This pull request thus removes the link scope on the IPv4 address of cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope (default).

This will be tested in the Cilium Datapath workflow via a subsequent pull request, but we need to fix one other bug before we can do that.

1 - IPsec doesn't matter to the bug here. Endpoint routes however does. If endpoint routes is enabled, Cilium adds a masquerading rule in front of kube-proxy's to always masquerade DNATed pod traffic to cilium_host IP address. See [3] for details.
Co-authored-by: Liu Xu liuxu623@gmail.com
Signed-off-by: Paul Chaignon paul@cilium.io

Fix masquerading bug that caused kube-proxy to pick the wrong IPv4 address in case of tunneling with endpoint routes.

Kube-proxy always masquerades DNATed packets going to NodePort services.
This is to ensure that reply packets always flow through the
intermediate, DNATing node. Consider the following path:

    pod@node1 -> nodeport@node2 -> backend@node3

A packet is sent from pod@node1 to a NodePort service with node2's IP
address. Node2 DNATs the packet and forwards it to the backend on node3.
If node2 doesn't also masquerade the packet, the reply packet will be
sent directly to node1, bypassing the reverse DNAT.

In tunneling mode however, kube-proxy appears unable to pick the correct
source IP for masquerading. Consider the following packet flow (under
VXLAN + endpoint routes + IPsec [1]):

    <- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
    -> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
    <- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    -> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    <- stack encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    -> overlay encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN

Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That
packet is masqueraded to 10.0.2.15 (line 3), the IP on the default
interface. This choice is incorrect as the packet will then go through
the tunnel and not the underlay. The reply will therefore not be sent
through the tunnel and may even fail if 10.0.2.15 isn't routable from
node 2 (as is the case in our testing setup).

Instead, kube-proxy should pick the IP address of cilium_host, which
belongs to the node's pod CIDR, thus ensuring the reply will be routed
through the tunnel. Why isn't it?

Checking the kernel's source code [2], we can see that the scope of IP
addresses on the interfaces is taken into account in addition to the
destination IP (and other packet information in case of source routing,
etc.). Specifically, in the case of netfilter's masquerading,
inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0).
Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE
will be picked.

This commit thus removes the link scope on the IPv4 address of
cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope
(default).

This will be tested in the Cilium Datapath workflow via a subsequent
pull request, but we need to fix one other bug before we can do that.

1 - IPsec doesn't matter to the bug here. Endpoint routes however does.
    If endpoint routes is enabled, Cilium adds a masquerading rule in
    front of kube-proxy's to always masquerade DNATed pod traffic to
    cilium_host IP address. See [3] for details.
2 - https://github.com/torvalds/linux/blob/v5.19/net/ipv4/devinet.c#L1324
3 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/pkg/datapath/iptables/iptables.go#L1216-L1242
Co-authored-by: Liu Xu <liuxu623@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.12 labels Jan 23, 2023
@pchaigno
Copy link
Member Author

pchaigno commented Jan 23, 2023

Previous run failed in https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/514/testReport/Suite-k8s-1/26/K8sDatapathServicesTest_Checks_N_S_loadbalancing_Tests_with_direct_routing_and_DSR/. Seems likely to be a flake given this change shouldn't affect KPR.
/test-1.26-net-next

@pchaigno pchaigno marked this pull request as ready for review January 23, 2023 16:10
@pchaigno pchaigno requested a review from a team as a code owner January 23, 2023 16:11
Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! That must have taken a while to debug, only to find its a one line change.

@pchaigno
Copy link
Member Author

Nice find! That must have taken a while to debug, only to find its a one line change.

I had help :-) While debugging this I had in mind #21738, which made the same change before. And Daniel also pointed to the kernel code when I asked about the masquerading logic.

@pchaigno
Copy link
Member Author

k8s-1.26-kernel-net-next first failed with #23309 then with #23307. Both seem unlikely to be caused by the present change given they run with KPR. Dylan reviewed. Marking ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 24, 2023
@ldelossa ldelossa merged commit 92a3e31 into cilium:master Jan 25, 2023
@pchaigno pchaigno deleted the fix-iptables-masquerading-tunneling branch January 25, 2023 16:20
@qmonnet qmonnet mentioned this pull request Jan 30, 2023
12 tasks
@qmonnet qmonnet mentioned this pull request Jan 31, 2023
27 tasks
@qmonnet qmonnet added backport-pending/1.13 backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed needs-backport/1.13 labels Jan 31, 2023
@asauber asauber added the affects/v1.11 This issue affects v1.11 branch label Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants