bpf: lxc: simplify RevNAT path for loopback replies #32480

julianwiedmann · 2024-05-11T09:00:52Z

This PR aims to clean up some long-standing tech debt, in how we handle reply traffic for a loopback connection (a client connecting to itself through a service).

Quoting from the first patch

 The usual flow for handling service traffic to a local backend is as
follows:
* requests are load-balanced in from-container. This entails selecting
a backend (and caching the selection in a CT_SERVICE entry), DNATing the
packet, creating a CT_EGRESS entry for the resulting `client -> backend`
flow, applying egress network policy, and local delivery to the backend
pod. As part of the local delivery, we also create a CT_INGRESS entry and
apply ingress network policy.
* replies bypass the backend's egress network policy (because the CT
lookup returns CT_REPLY), and pass to the client via local delivery. In
the client's ingress path they bypass ingress network policy (the packets
match as reply against the CT_EGRESS entry), and we apply RevDNAT based on
the `rev_nat_index` in the CT_EGRESS entry.

For a loopback connection (where the client pod is selected as backend for
the connection) this looks slightly more complicated:
* As we can't establish a `client -> client` connection, the requests are
also SNATed with IPV4_LOOPBACK. Network policy in forward direction is
explicitly skipped (as the matched CT entries have the `.loopback` flag
set).
* In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a
valid IP for an endpoint lookup). So a reply already gets fully RevNATed
by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this
means that when passing into the client pod (either via to-container, or
via the ingress policy tail-call), the packet doesn't match as reply to the
CT_EGRESS entry - and so we don't benefit from automatic network policy
bypass. We ended up with two workarounds for this aspect:
(1) when to-container is installed, it contains custom logic to match the
    packet as a loopback reply, and skip ingress policy
    (see https://github.com/cilium/cilium/pull/27798).
(2) otherwise we skip the ingress policy tailcall, and forward the packet
    straight into the client pod.

The downside of these workarounds is that we bypass the *whole* ingress
program, not just the network policy part. So the CT_EGRESS entry doesn't
get updated (lifetime, statistics, observed packet flags, ...), and we
have the hidden risk that when we add more logic to the ingress program,
it doesn't get executed for loopback replies.

This patch aims to eliminate the need for such workarounds. At its core,
it detects loopback replies in from-container and overrides the packet's
destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK,
we can now look up the actual client endpoint - and deliver to the ingress
policy program, *without* needing to early-RevNAT the packet. Instead the
replies follow the usual packet flow, match the CT_EGRESS entry in the
ingress program, naturally bypass ingress network policy, and are *then*
RevNATed based on the CT_EGRESS entry's `rev_nat_index`.

Consequently we follow the standard datapath, without needing to skip over
policy programs. The CT_EGRESS entry is updated for every reply.

Thus we can also remove the manual policy bypass for loopback replies,
when using per-EP routing. It's no longer needed and in fact the
replies will no longer match the lookup logic, as they haven't
been RevNATed yet. This effectively reverts
e2829a061a53 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes").

julianwiedmann · 2024-05-11T09:01:09Z

/test

julianwiedmann · 2024-05-11T09:27:36Z

/test

julianwiedmann · 2024-05-11T09:44:53Z

/test

julianwiedmann · 2024-05-11T10:52:08Z

/test

julianwiedmann · 2024-05-17T08:19:00Z

/test

julianwiedmann · 2024-05-22T10:30:26Z

Rebase to resolve a trivial conflict.

julianwiedmann · 2024-05-22T10:30:35Z

/test

ti-mo

Glad the hack is gone. 👍 Can't really give a meaningful comprehensive review, though.

julianwiedmann · 2024-05-23T13:14:17Z

Cleaned up a bit more code in from-container, the revalidate_data() was no longer needed.

julianwiedmann · 2024-05-23T13:14:35Z

/test

The usual flow for handling service traffic to a local backend is as follows: * requests are load-balanced in from-container. This entails selecting a backend (and caching the selection in a CT_SERVICE entry), DNATing the packet, creating a CT_EGRESS entry for the resulting `client -> backend` flow, applying egress network policy, and local delivery to the backend pod. As part of the local delivery, we also create a CT_INGRESS entry and apply ingress network policy. * replies bypass the backend's egress network policy (because the CT lookup returns CT_REPLY), and pass to the client via local delivery. In the client's ingress path they bypass ingress network policy (the packets match as reply against the CT_EGRESS entry), and we apply RevDNAT based on the `rev_nat_index` in the CT_EGRESS entry. For a loopback connection (where the client pod is selected as backend for the connection) this looks slightly more complicated: * As we can't establish a `client -> client` connection, the requests are also SNATed with IPV4_LOOPBACK. Network policy in forward direction is explicitly skipped (as the matched CT entries have the `.loopback` flag set). * In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a valid IP for an endpoint lookup). So a reply already gets fully RevNATed by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this means that when passing into the client pod (either via to-container, or via the ingress policy tail-call), the packet doesn't match as reply to the CT_EGRESS entry - and so we don't benefit from automatic network policy bypass. We ended up with two workarounds for this aspect: (1) when to-container is installed, it contains custom logic to match the packet as a loopback reply, and skip ingress policy (see cilium#27798). (2) otherwise we skip the ingress policy tailcall, and forward the packet straight into the client pod. The downside of these workarounds is that we bypass the *whole* ingress program, not just the network policy part. So the CT_EGRESS entry doesn't get updated (lifetime, statistics, observed packet flags, ...), and we have the hidden risk that when we add more logic to the ingress program, it doesn't get executed for loopback replies. This patch aims to eliminate the need for such workarounds. At its core, it detects loopback replies in from-container and overrides the packet's destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK, we can now look up the actual client endpoint - and deliver to the ingress policy program, *without* needing to early-RevNAT the packet. Instead the replies follow the usual packet flow, match the CT_EGRESS entry in the ingress program, naturally bypass ingress network policy, and are *then* RevNATed based on the CT_EGRESS entry's `rev_nat_index`. Consequently we follow the standard datapath, without needing to skip over policy programs. The CT_EGRESS entry is updated for every reply. Thus we can also remove the manual policy bypass for loopback replies, when using per-EP routing. It's no longer needed and in fact the replies will no longer match the lookup logic, as they haven't been RevNATed yet. This effectively reverts e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes"). Signed-off-by: Julian Wiedmann <jwi@isovalent.com>

The `hairpin_flow` parameter was previously needed so that loopback replies could bypass the ingress network policy. But now that all callers set this parameter to `false`, we can safely remove the corresponding logic. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>

julianwiedmann · 2024-06-13T08:03:02Z

/test

julianwiedmann · 2024-06-13T08:03:29Z

(no-change rebase to refresh CI result)

jibi

my understanding on the current state of things:

when a client selects itself as service backend we need to SNAT the packet with IPV4_LOOPBACK to deal with the martian source issue, otherwise the kernel would just drop such traffic
when dealing with replies, we decided to go for an early rev-SNAT approach, to allow looking up the destination endpoint, as otherwise that would fail with the IPV4_LOOPBACK ip
because of this we need to handle a couple of special cases (which have the side effect of skipping a good chunk of the ingress program), as we don't have a CT entry to detect such reply traffic

the proposed solution is:

don't do early rev-SNAT and override the daddr we pass to lookup_ip4_endpoint with the saddr, as for hairpin packets we can assume revNATed daddr == saddr
this allows to handle this traffic like regular traffic
we then do revSNAT as any other regular traffic

and that LGTM

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from 5ecbae1 to a7a8d53 Compare May 11, 2024 09:27

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch 2 times, most recently from c4a430e to afd51f7 Compare May 11, 2024 09:33

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from afd51f7 to 5c1b041 Compare May 11, 2024 09:59

julianwiedmann changed the title ~~1.16 bpf loopback revnat~~ bpf: lxc: simplify RevNAT path for loopback replies May 17, 2024

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from 5c1b041 to 5e7d3f5 Compare May 17, 2024 08:18

julianwiedmann added the dont-merge/preview-only Only for preview or testing, don't merge it. label May 17, 2024

julianwiedmann marked this pull request as ready for review May 17, 2024 08:20

julianwiedmann requested a review from a team as a code owner May 17, 2024 08:20

julianwiedmann requested review from jibi and ti-mo and removed request for jibi May 17, 2024 08:20

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from 5e7d3f5 to 206e9a9 Compare May 22, 2024 10:30

ti-mo approved these changes May 22, 2024

View reviewed changes

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from 206e9a9 to f04f064 Compare May 23, 2024 13:13

julianwiedmann removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Jun 13, 2024

julianwiedmann added 2 commits June 13, 2024 11:02

julianwiedmann force-pushed the 1.16-bpf-loopback-revnat branch from f04f064 to 6461fc8 Compare June 13, 2024 08:02

jibi approved these changes Jun 14, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 14, 2024

julianwiedmann added this pull request to the merge queue Jun 14, 2024

julianwiedmann mentioned this pull request Jun 14, 2024

[v1.17] bpf: remove unused logic to propagate rev_nat_index for loopback connections #33154

Closed

Merged via the queue into cilium:main with commit 572bca4 Jun 14, 2024

julianwiedmann deleted the 1.16-bpf-loopback-revnat branch June 14, 2024 11:52

julianwiedmann mentioned this pull request Jun 17, 2024

kpr: stabilize Service handling for loopback connections #33194

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: lxc: simplify RevNAT path for loopback replies #32480

bpf: lxc: simplify RevNAT path for loopback replies #32480

Uh oh!

julianwiedmann commented May 11, 2024 •

edited

Loading

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 17, 2024

Uh oh!

julianwiedmann commented May 22, 2024

Uh oh!

julianwiedmann commented May 22, 2024

Uh oh!

ti-mo left a comment

Uh oh!

julianwiedmann commented May 23, 2024

Uh oh!

julianwiedmann commented May 23, 2024

Uh oh!

julianwiedmann commented Jun 13, 2024

Uh oh!

julianwiedmann commented Jun 13, 2024

Uh oh!

jibi left a comment

Uh oh!

Uh oh!

bpf: lxc: simplify RevNAT path for loopback replies #32480

bpf: lxc: simplify RevNAT path for loopback replies #32480

Uh oh!

Conversation

julianwiedmann commented May 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 11, 2024

Uh oh!

julianwiedmann commented May 17, 2024

Uh oh!

julianwiedmann commented May 22, 2024

Uh oh!

julianwiedmann commented May 22, 2024

Uh oh!

ti-mo left a comment

Choose a reason for hiding this comment

Uh oh!

julianwiedmann commented May 23, 2024

Uh oh!

julianwiedmann commented May 23, 2024

Uh oh!

julianwiedmann commented Jun 13, 2024

Uh oh!

julianwiedmann commented Jun 13, 2024

Uh oh!

jibi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julianwiedmann commented May 11, 2024 •

edited

Loading