Skip to content

bpf: Enforce symmetric routing for endpoints with parent interfaces #35298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dylandreimerink
Copy link
Member

@dylandreimerink dylandreimerink commented Oct 8, 2024

In certain setups, such as AWS ENI, it is expected that non-masqueraded traffic from an endpoint always egresses via a designated interface. We call this interface its parent interface.

Normally, this is taken care of by routes and assumptions. We have routes that direct traffic to the parent interface, for pod-to-pod and CIDRs in the non-masquerade list.

However, there are cases such as when the AWS NLB is configured to run in "target type: ip" (balance traffic directly to pod IPs) and with the preserve_client_ip option enabled (do not SNAT the source IP) that the route based approach breaks down. In this case, Cilium sees world traffic directly to pod IPs, which normally can't happen, except
for the NLB transparent traffic modifications. For ingress traffic this is not an issue, but return traffic does not hit any of the routes that would direct it to the parent interface. Nor can we make a route since every possible public IP can be a source.

As a fix, this PR adds logic to the snat/fwd path to check if we see reply traffic from an endpoint with a parent interface egressing out of any interface other than the parent interface. If we do we redirect the traffic to the parent interface. The location of this logic is chosen so that its compatible with proxied traffic as well.

This is a pwru log from a manual test in EKS:

SKB                CPU PROCESS          NETNS      MARK/x        IFACE       PROTO  MTU   LEN   TUPLE FUNC
# Reply packet gets created in process
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026533432 0               0         0x0000 8841  60    10.26.102.102:80->77.168.92.51:48766(tcp) __ip_local_out
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026533432 0               0         0x0800 8841  60    10.26.102.102:80->77.168.92.51:48766(tcp) ip_output
# Gets routed to eth0 inside the pod
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026533432 0            eth0:39      0x0800 8841  60    10.26.102.102:80->77.168.92.51:48766(tcp) ip_finish_output
[...]
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026533432 0            eth0:39      0x0800 8921  74    10.26.102.102:80->77.168.92.51:48766(tcp) eth_type_trans
# Packet traverses the veth pair, now enters the host NS
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 0        ~007925156fc8:40 0x0800 8921  60    10.26.102.102:80->77.168.92.51:48766(tcp) netif_rx
[...]
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 b1980f00 ~007925156fc8:40 0x0800 8921  60    10.26.102.102:80->77.168.92.51:48766(tcp) ip_output
# Packet gets routed to eth0 of the host (expected for traffic to a world IP)
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 b1980f00      eth0:2      0x0800 8921  60    10.26.102.102:80->77.168.92.51:48766(tcp) nf_hook_slow
[...]
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 b1980f00      eth0:2      0x0800 8921  74    10.26.102.102:80->77.168.92.51:48766(tcp) __bpf_redirect
# The new logic kicks in, detects its reply traffic and redirects to eth1 (the parent interface for 10.26.102.102)
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 b1980f00     eth1:34      0x0800 8921  74    10.26.102.102:80->77.168.92.51:48766(tcp) dev_queue_xmit
[...]
# Packet actually is sent out
0xffff88d67b311f00 1   ~/bin/pwru:18652 4026531992 b1980f00     eth1:34      0x0800 8921  74    10.26.102.102:80->77.168.92.51:48766(tcp) dev_hard_start_xmit
[...]
0xffff88d67b311f00 3   ~/bin/pwru:18652 4026531992 b1980f00     eth1:34      0x0800 8921  74    10.26.102.102:80->77.168.92.51:48766(tcp) kfree_skbmem

@dylandreimerink dylandreimerink added kind/enhancement This would improve or streamline existing functionality. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/eni Impacts ENI based IPAM. release-note/misc This PR makes changes that have no direct user impact. labels Oct 8, 2024
@dylandreimerink dylandreimerink changed the title Feature/eni reply pod traffic via parent interface bpf: Enforce symmetric routing for endpoints with parent interfaces Oct 8, 2024
@dylandreimerink dylandreimerink force-pushed the feature/eni-reply-pod-traffic-via-parent-interface branch from 4eeed88 to 8757292 Compare October 9, 2024 14:53
@dylandreimerink
Copy link
Member Author

/test

1 similar comment
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/eni-reply-pod-traffic-via-parent-interface branch 2 times, most recently from 1f45d3d to 940ddb2 Compare November 4, 2024 14:47
@dylandreimerink dylandreimerink force-pushed the feature/eni-reply-pod-traffic-via-parent-interface branch 3 times, most recently from 4852be3 to 7f17b08 Compare November 5, 2024 14:49
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink force-pushed the feature/eni-reply-pod-traffic-via-parent-interface branch from 7f17b08 to 9fa8e21 Compare November 5, 2024 15:05
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink marked this pull request as ready for review November 6, 2024 11:08
@dylandreimerink dylandreimerink requested review from a team as code owners November 6, 2024 11:08
@gandro gandro self-requested a review November 6, 2024 16:43
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean implementation! Looks good to me from an IPAM/ENI perspective.

I wonder if we could actually use the parent if index to get rid of the ip rules in certain KPR modes too.

The ENI "ip rule" lifecycle management have been a repeated source of bugs. I think having only a field in the endpoint map rather than a set or rules and routes would simplify state management quite a bit. But that's of course work for later, not something we should do in this PR.

julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Jan 31, 2025
As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] cilium#35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Jan 31, 2025
Explain why we're doing a plain redirect, without rewriting the DMAC.
See cilium#35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 6, 2025
As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] #35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 6, 2025
Explain why we're doing a plain redirect, without rewriting the DMAC.
See #35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
jongj pushed a commit to jongj/cilium that referenced this pull request Feb 11, 2025
As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] cilium#35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
jongj pushed a commit to jongj/cilium that referenced this pull request Feb 11, 2025
Explain why we're doing a plain redirect, without rewriting the DMAC.
See cilium#35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit 94432bf ]

As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] #35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit f1ab7eb ]

Explain why we're doing a plain redirect, without rewriting the DMAC.
See #35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit 94432bf ]

As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] #35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit f1ab7eb ]

Explain why we're doing a plain redirect, without rewriting the DMAC.
See #35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit 94432bf ]

As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] #35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Feb 14, 2025
[ upstream commit f1ab7eb ]

Explain why we're doing a plain redirect, without rewriting the DMAC.
See #35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 20, 2025
[ upstream commit 94432bf ]

As discussed in [0], this feature should currently only be used when
BPF masquerading is enabled.

[0] #35298 (comment)

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 20, 2025
[ upstream commit f1ab7eb ]

Explain why we're doing a plain redirect, without rewriting the DMAC.
See #35298 (comment).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request May 16, 2025
In cilium#35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request May 21, 2025
In cilium#35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request May 22, 2025
In cilium#35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request May 22, 2025
In cilium#35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request May 26, 2025
In #35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
viktor-kurchenko pushed a commit that referenced this pull request May 28, 2025
[ upstream commit 0b6adf6 ]

In #35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request May 28, 2025
[ upstream commit 0b6adf6 ]

In #35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com>
zzuckerfrei pushed a commit to zzuckerfrei/cilium that referenced this pull request May 29, 2025
In cilium#35298 endpoints got the parentIfIndex field to store the parent
interface index, which is used to ensure endpoint traffic is routed
out of the parent interface when set.

This interface index is given to the agent by the CNI plugin and stored
in the endpoint object. However, when the agent is restarted, it has
to restore endpoint state from disk. The parentIfIndex field was not
being stored or restored, so after a restart the parentIfIndex would be
set to 0, which would cause the agent to not route traffic out of the
parent interface.

This commit adds the parentIfIndex field to the endpoint state
serialization and deserialization process, so that it is stored and
restored correctly.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/eni Impacts ENI based IPAM. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants