Implement proxy redirection logic in eBPF #11279

joestringer · 2020-05-01T01:55:59Z

Summary

Introduce a new implementation of the proxy redirect handling in the eBPF datapath using a combination of socket lookups and socket assign (+ packet redirects where necessary to ensure that socket assign occurs at TC ingress).

This functionality is disabled by default in the daemon, but enabled in new v1.9 deployments via Helm. When enabled on Linux 5.7 or later, it removes use of the -j TPROXY target, and removes Cilium's usage of the upper 16 bits of the mark for the purpose of sharing the tproxy target port with iptables rules.

API impact

Three new monitor error codes are added for cases where the redirect cannot be applied, meaning that traffic subject to L7 policy will be dropped if there is a configuration or lookup issue:

DROP_PROXY_LOOKUP_FAILED
- Lookup for the target proxy socket failed. This will trigger drop when the datapath cannot find any establish proxy connection for this tuple AND the datapath cannot find a TPROXY listening socket corresponding to the redirect port in the eBPF policy map. If the proxy is not available for any reason then this could be a reason to see this error.
DROP_PROXY_SET_FAILED
- Failed to assign the socket to the packet context. Typically programmer error but may trigger if the socket which was found previously is unavailable or has incompatible options set (eg SO_REUSEPORT). In the event of this error, the default monitor logs will additionally provide an errno to determine which subfailure occurred.
DROP_PROXY_UNKNOWN_PROTO
- Safety catchall in case we attempt to redirect traffic other than TCP/UDP to a socket. As of Linux 5.7, no other L4 protocols are supported for tproxy redirection.

Implementation

Previously, in the eBPF datapath during policy apply and redirect processing, we have relied purely on the tproxy port which is available in the policymap to prepare the packet for TPROXY redirection. Mechanically this path involved performing a policy lookup using the src/dst identity, then in the policymap entry, using non-zero redirect port as a signal for redirect. This port would then be set into the ctx->mark along with some bits to identify this traffic as "needs proxy redirection". Then the packet would be passed up the stack towards an iptables PREROUTING rule which would match on the mark then perform a socket lookup and assign the socket to the skb.

The first step of the new implementation is to ensure that the relevant forwarding logic has access to the full packet tuple. There is one fairly significant condition for applying the socket to the skb: It must be done at TC ingress. In many cases, Cilium already applies all logic at TC ingress eBPF programs so this isn't too difficult. For endpoint-routes mode however we apply pod ingress policy at the TC egress hook point, and therefore must first hairpin to another device, then extract the tuple & perform the socket lookups to assign the socket to the skb.

With this in place, the actual lookup & assign is relatively straightforward. First we must look for any established connections corresponding to the tuple in the local network namespace.These represent existing proxy connections that were already established. If we find the socket here, we assign it to the skb & pass up the stack. If we don't find it, then we zero out the tuple and set the TPROXY redirect port from the policy into the tuple and perform a second lookup for listen sockets. This should provide the parent listener socket associated with the proxy which we can assign to establish a new connection in the proxy.

For now, this PR introduces dedicated tuple extraction logic in bpf_host.c for the endpoint routes mode case rather than sharing it with host firewall, primarily because it predates host firewall but also because the tuple extraction needs to be performed if eBPF tproxy is enabled but host firewall is not. A future cleanup could look at shifting tuple extraction into common code which would be enabled by either path and shared. Further optimizations could be explored to perhaps avoid further packet touching in this prog by sharing some of that metadata from the previous program that was run at TC egress of the endpoint subject to L7 policy.

This PR replaces the per-redirect iptables rule logic but not the static -m socket logic as this requires further exploration to determine the simplest path forward. Fortunately, the newly added connectivity checks for proxy paths makes it easy to reproduce failures on these paths to allow investigation and experimentation.

Testing

The recently-added examples/kubernetes/connectivity-check-proxy.yaml is now added to the main CI for testing various proxy paths to prevent regressions. Additionally I have manually deployed this YAML for debugging in a GKE environment (default instructions, endpoint-routes mode).

The BPF checks / checkpatch (pull_request) GitHub action job doesn't seem to like my use of macros to avoid duplicating IPv4/IPv6 implementations, but I've looked through the complaints and AFAICT they're all safe to ignore.

Tasks

Daemon logic to override enable/disable
- Iptables compatibility mode that still configures rules for compatibility with existing rulesets (eg kube-proxy)
IPv6 support
Endpoint-routes mode validation
CI tests to validate the new paths
Issue description for reviewers :-)

Deferred to follow up PRs

Replace -m socket rules - I looked briefly into this, should be doable in one of a couple of ways but it's more involved so I figured it's better to get the basics in first.
This PR touches the issue Unable to upgrade from Cilium 1.8 to 1.9 without breaking compatibility #12288 but does not resolve it.

I marked this 'release-note/major' with the (ambitious?) goal of it being listed with other PRs as part of iptables removal for v1.9, but ideally it should have no user-facing impact :-) Happy to move it to /minor if we think this is not v1.9 release higlight worthy.

Related: #9921

coveralls · 2020-05-01T02:22:20Z

Coverage decreased (-0.003%) to 37.134% when pulling 49693ef219db40ca8eae710c11a368afc74cf5bd on joestringer:submit/bpf-tproxy into bd89e83 on cilium:master.

errordeveloper

CLI and Helm changes LGTM.

Use the define here to improve readability and locating the code that changes specific offsets in the context metadata (eg skb->cb). The value is the same, this is a purely cosmetic change. Signed-off-by: Joe Stringer <joe@cilium.io>

When hairpinning, it's useful to have two datapoints for the proxy redirect: One in the original program and one in the program that actually prepares the context to send the packet to the proxy. The ctx_redirect_to_proxy_hairpin one should occur in the original program, so change it to DBG_CAPTURE_PROXY_PRE. The corresponding DBG_CAPTURE_PROXY_POST should be handled in the "to-host" program later. Signed-off-by: Joe Stringer <joe@cilium.io>

Add drop and debug notifications that will be used by upcoming commits to emit informational events from the datapath when BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>

Define these helpers for use in upcoming BPF TPROXY commits. Signed-off-by: Joe Stringer <joe@cilium.io>

In preparation for implementing BPF TPROXY in ctx_redirect_to_proxy(), pass the tuple into this function from all of its callers. In this commit it will remain unused but in future commits this will begin to be used in particular cases to perform a socket lookup in the callee. When endpoint-routes mode is enabled, we're executing on TC egress which can't make use of the tuple yet so we won't extract the tuple out at this point (pass NULL). The "hairpin" redirect will redirect this back to a device where we can perform the tuple extraction & socket lookup. NOTE: At this point when passing the tuple in, I am making the assumption that it is no longer needed at these call points so ctx_redirect_to_proxy() is welcome to thrash it and use it as scratch buffer as it searches for the relevant sockets. As such it is deliberately not marked "const". If this assumption is invalid or reviewers have suggestions how to alleviate this assumption, please comment in the review. If you're reading this in the future and have hit an issue due to rewrites of the tuple in this function, then.... sorry? :-) Signed-off-by: Joe Stringer <joe@cilium.io>

In cases where the traffic is being processed in a BPF program at TC ingress, ie the traffic is already heading towards the host, we can implement the socket lookup & assign logic directly in the proxy redirect logic here. This applies, for instance, to traffic arriving from an endpoint on the lxc device, or from a tunnel (via tail call to the endpoint policy program). In these cases, we attempt to first locate any full sockets associated with the tuple, and if the connection is already established, assign the corresponding socket to the skb to notify the stack to pass the traffic into the socket. If the full socket cannot be found, then we fall back to listen sockets for the proxy port and assign that one. Signed-off-by: Joe Stringer <joe@cilium.io>

For stack packet receive ~reasons~, the ability to assign a socket to the context is only supported at TC ingress. As such, when for instance the host itself is sending traffic to endpoints on the same node, or when the egress proxy is sending traffic towards an endpoint that also has ingress L7 policy applied, we need to redirect the packet towards an ingress TC program before assigning the socket to the context. The process of redirecting the traffic towards an ingress qdisc was already present as this is necessary to actually present the traffic to the stack to send to the proxy; this commit extends that logic to perform the same logic as the previous commit for socket assign from the host ingress program. In the case of the host ingress program, we don't have the luxury of having already extracted the packet tuple for performing socket lookup, so in this case we are required to perform this extraction again. (*) (*) This was written pre-hostfw so there may be some room to perform better sharing of the implementation for this case now. Signed-off-by: Joe Stringer <joe@cilium.io>

This flag supports manually enabling or disabling the BPF-based tproxy functionality as supported on Linux 5.7 or later. If it is enabled, then the iptables rules which perform tproxy action will be disabled and the equivalent functionality will be enabled in BPF in the datapath. Some rules such as ones that configure the stack to ignore proxy flows and avoid connection tracking them via iptables are left in place in case the user still has iptables enabled in their environment. Signed-off-by: Joe Stringer <joe@cilium.io>

Default it to enabled for new v1.9 installs, v1.8 upgrades will disable. Signed-off-by: Joe Stringer <joe@cilium.io>

Add a test which deploys connectivity-check-proxy.yaml from the examples/kubernetes/connectivity-check/ dirs, to validate the various egress/ingress/dual proxy cases. Related: cilium#12714 Signed-off-by: Joe Stringer <joe@cilium.io>

The upstream BPF TPROXY support doesn't support REUSEPORT sockets at this time, so disable REUSEPORT if BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>

When making use of eBPF TPROXY, avoid setting bits in the upper portions of the mark which would previously be used in non-eBPF TPROXY for determining the transparent socket port. These bits do not need to be set in this case, so leaving them unset helps to better integrate with other potential users of the mark. Suggested-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2020-08-26T16:11:16Z

test-me-please

joestringer · 2020-08-26T20:08:44Z

This PR has now passed all required CI checks and has green codeowners. It does not address the "upgrade issue" (#12288) but makes some extensions to the YAMLs following the existing code. In the absence of further feedback in that area I will merge this as-is and we can follow up separately in that relevant issue on how to properly handle that case.

bpf/include/bpf/helpers.h

bpf/bpf_lxc.c

bpf/lib/proxy.h

This comment has been minimized.

Sign in to view

maintainer-s-little-helper bot added the dont-merge/needs-release-note label May 1, 2020

joestringer force-pushed the submit/bpf-tproxy branch 3 times, most recently from 453394c to dab004c Compare May 7, 2020 21:11

This comment has been minimized.

Sign in to view

joestringer force-pushed the submit/bpf-tproxy branch from dab004c to d88b94a Compare May 8, 2020 23:10

This comment has been minimized.

Sign in to view

joestringer added the ci/fail-fast label May 8, 2020

joestringer force-pushed the submit/bpf-tproxy branch 4 times, most recently from cfc9a2f to 6858357 Compare May 15, 2020 00:02

This comment has been minimized.

Sign in to view

stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 14, 2020

joestringer removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 18, 2020

joestringer self-assigned this Jul 2, 2020

joestringer force-pushed the submit/bpf-tproxy branch from 6858357 to 4522290 Compare July 17, 2020 00:45

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 17, 2020

joestringer added the ci/net-next label Jul 17, 2020

This comment has been minimized.

Sign in to view

joestringer force-pushed the submit/bpf-tproxy branch from 4522290 to 4994c61 Compare August 4, 2020 00:57

joestringer added release-note/major This PR introduces major new functionality to Cilium. and removed ci/net-next dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 4, 2020

joestringer force-pushed the submit/bpf-tproxy branch from 4994c61 to 09263c7 Compare August 4, 2020 00:59

errordeveloper approved these changes Aug 19, 2020

View reviewed changes

joestringer added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Aug 25, 2020

joestringer mentioned this pull request Aug 25, 2020

Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

Closed

joestringer added 12 commits August 26, 2020 09:09

bpf: Use CB_PROXY_MAGIC to store proxy meta magic

ab258c8

Use the define here to improve readability and locating the code that changes specific offsets in the context metadata (eg skb->cb). The value is the same, this is a purely cosmetic change. Signed-off-by: Joe Stringer <joe@cilium.io>

monitor: Add notifications for socket failures

c8ad7d5

Add drop and debug notifications that will be used by upcoming commits to emit informational events from the datapath when BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>

bpf: Add socket lookup,assign helpers

105b673

Define these helpers for use in upcoming BPF TPROXY commits. Signed-off-by: Joe Stringer <joe@cilium.io>

install: Add support to configure BPF tproxy

8a7e544

Default it to enabled for new v1.9 installs, v1.8 upgrades will disable. Signed-off-by: Joe Stringer <joe@cilium.io>

test: Add proxy datapaths test

3b1a5ea

Add a test which deploys connectivity-check-proxy.yaml from the examples/kubernetes/connectivity-check/ dirs, to validate the various egress/ingress/dual proxy cases. Related: cilium#12714 Signed-off-by: Joe Stringer <joe@cilium.io>

dnsproxy: Disable SO_REUSEPORT with bpf tproxy

718c164

The upstream BPF TPROXY support doesn't support REUSEPORT sockets at this time, so disable REUSEPORT if BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer force-pushed the submit/bpf-tproxy branch from 827f7b6 to 691449d Compare August 26, 2020 16:10

joestringer removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Aug 26, 2020

joestringer merged commit 88688c3 into cilium:master Aug 26, 2020

joestringer deleted the submit/bpf-tproxy branch August 26, 2020 20:08

borkmann reviewed Aug 26, 2020

View reviewed changes

bpf/include/bpf/helpers.h Show resolved Hide resolved

borkmann reviewed Aug 26, 2020

View reviewed changes

bpf/bpf_lxc.c Show resolved Hide resolved

borkmann reviewed Aug 26, 2020

View reviewed changes

bpf/lib/proxy.h Show resolved Hide resolved

borkmann reviewed Aug 26, 2020

View reviewed changes

bpf/lib/proxy.h Show resolved Hide resolved

borkmann mentioned this pull request Aug 26, 2020

Unable to upgrade from Cilium 1.8 to 1.9 without breaking compatibility #12288

Closed

joestringer mentioned this pull request Sep 2, 2020

Address code tidiness items for bpf tproxy #13050

Merged

joestringer mentioned this pull request Oct 12, 2020

Prepare for release v1.9.0-rc1 #13518

Merged

brb mentioned this pull request Mar 25, 2021

Add Wireguard support #15383

Merged

pchaigno mentioned this pull request Jan 10, 2022

bpf: Fix implicit cast for BPF TPROXY debug message #18429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement proxy redirection logic in eBPF #11279

Implement proxy redirection logic in eBPF #11279

Uh oh!

joestringer commented May 1, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

coveralls commented May 1, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

errordeveloper left a comment

Uh oh!

joestringer commented Aug 26, 2020

Uh oh!

joestringer commented Aug 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement proxy redirection logic in eBPF #11279

Implement proxy redirection logic in eBPF #11279

Uh oh!

Conversation

joestringer commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API impact

Implementation

Testing

Tasks

Uh oh!

This comment has been minimized.

coveralls commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

errordeveloper left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Aug 26, 2020

Uh oh!

joestringer commented Aug 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joestringer commented May 1, 2020 •

edited

Loading

coveralls commented May 1, 2020 •

edited

Loading