Skip to content

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented May 1, 2020

Summary

Introduce a new implementation of the proxy redirect handling in the eBPF datapath using a combination of socket lookups and socket assign (+ packet redirects where necessary to ensure that socket assign occurs at TC ingress).

This functionality is disabled by default in the daemon, but enabled in new v1.9 deployments via Helm. When enabled on Linux 5.7 or later, it removes use of the -j TPROXY target, and removes Cilium's usage of the upper 16 bits of the mark for the purpose of sharing the tproxy target port with iptables rules.

API impact

Three new monitor error codes are added for cases where the redirect cannot be applied, meaning that traffic subject to L7 policy will be dropped if there is a configuration or lookup issue:

  • DROP_PROXY_LOOKUP_FAILED
    • Lookup for the target proxy socket failed. This will trigger drop when the datapath cannot find any establish proxy connection for this tuple AND the datapath cannot find a TPROXY listening socket corresponding to the redirect port in the eBPF policy map. If the proxy is not available for any reason then this could be a reason to see this error.
  • DROP_PROXY_SET_FAILED
    • Failed to assign the socket to the packet context. Typically programmer error but may trigger if the socket which was found previously is unavailable or has incompatible options set (eg SO_REUSEPORT). In the event of this error, the default monitor logs will additionally provide an errno to determine which subfailure occurred.
  • DROP_PROXY_UNKNOWN_PROTO
    • Safety catchall in case we attempt to redirect traffic other than TCP/UDP to a socket. As of Linux 5.7, no other L4 protocols are supported for tproxy redirection.

Implementation

Previously, in the eBPF datapath during policy apply and redirect processing, we have relied purely on the tproxy port which is available in the policymap to prepare the packet for TPROXY redirection. Mechanically this path involved performing a policy lookup using the src/dst identity, then in the policymap entry, using non-zero redirect port as a signal for redirect. This port would then be set into the ctx->mark along with some bits to identify this traffic as "needs proxy redirection". Then the packet would be passed up the stack towards an iptables PREROUTING rule which would match on the mark then perform a socket lookup and assign the socket to the skb.

The first step of the new implementation is to ensure that the relevant forwarding logic has access to the full packet tuple. There is one fairly significant condition for applying the socket to the skb: It must be done at TC ingress. In many cases, Cilium already applies all logic at TC ingress eBPF programs so this isn't too difficult. For endpoint-routes mode however we apply pod ingress policy at the TC egress hook point, and therefore must first hairpin to another device, then extract the tuple & perform the socket lookups to assign the socket to the skb.

With this in place, the actual lookup & assign is relatively straightforward. First we must look for any established connections corresponding to the tuple in the local network namespace.These represent existing proxy connections that were already established. If we find the socket here, we assign it to the skb & pass up the stack. If we don't find it, then we zero out the tuple and set the TPROXY redirect port from the policy into the tuple and perform a second lookup for listen sockets. This should provide the parent listener socket associated with the proxy which we can assign to establish a new connection in the proxy.

For now, this PR introduces dedicated tuple extraction logic in bpf_host.c for the endpoint routes mode case rather than sharing it with host firewall, primarily because it predates host firewall but also because the tuple extraction needs to be performed if eBPF tproxy is enabled but host firewall is not. A future cleanup could look at shifting tuple extraction into common code which would be enabled by either path and shared. Further optimizations could be explored to perhaps avoid further packet touching in this prog by sharing some of that metadata from the previous program that was run at TC egress of the endpoint subject to L7 policy.

This PR replaces the per-redirect iptables rule logic but not the static -m socket logic as this requires further exploration to determine the simplest path forward. Fortunately, the newly added connectivity checks for proxy paths makes it easy to reproduce failures on these paths to allow investigation and experimentation.

Testing

The recently-added examples/kubernetes/connectivity-check-proxy.yaml is now added to the main CI for testing various proxy paths to prevent regressions. Additionally I have manually deployed this YAML for debugging in a GKE environment (default instructions, endpoint-routes mode).

The BPF checks / checkpatch (pull_request) GitHub action job doesn't seem to like my use of macros to avoid duplicating IPv4/IPv6 implementations, but I've looked through the complaints and AFAICT they're all safe to ignore.

Tasks

  • Daemon logic to override enable/disable
    • Iptables compatibility mode that still configures rules for compatibility with existing rulesets (eg kube-proxy)
  • IPv6 support
  • Endpoint-routes mode validation
  • CI tests to validate the new paths
  • Issue description for reviewers :-)

Deferred to follow up PRs

I marked this 'release-note/major' with the (ambitious?) goal of it being listed with other PRs as part of iptables removal for v1.9, but ideally it should have no user-facing impact :-) Happy to move it to /minor if we think this is not v1.9 release higlight worthy.

Related: #9921

@maintainer-s-little-helper

This comment has been minimized.

@coveralls
Copy link

coveralls commented May 1, 2020

Coverage Status

Coverage decreased (-0.003%) to 37.134% when pulling 49693ef219db40ca8eae710c11a368afc74cf5bd on joestringer:submit/bpf-tproxy into bd89e83 on cilium:master.

@joestringer joestringer force-pushed the submit/bpf-tproxy branch 3 times, most recently from 453394c to dab004c Compare May 7, 2020 21:11
@joestringer

This comment has been minimized.

@joestringer joestringer force-pushed the submit/bpf-tproxy branch from dab004c to d88b94a Compare May 8, 2020 23:10
@joestringer

This comment has been minimized.

@joestringer joestringer force-pushed the submit/bpf-tproxy branch 4 times, most recently from cfc9a2f to 6858357 Compare May 15, 2020 00:02
@joestringer

This comment has been minimized.

@stale

This comment has been minimized.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 14, 2020
@joestringer joestringer removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 18, 2020
@joestringer joestringer self-assigned this Jul 2, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 17, 2020
@joestringer

This comment has been minimized.

@joestringer joestringer added release-note/major This PR introduces major new functionality to Cilium. and removed ci/net-next dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 4, 2020
Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI and Helm changes LGTM.

@joestringer joestringer added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Aug 25, 2020
Use the define here to improve readability and locating the code that
changes specific offsets in the context metadata (eg skb->cb). The value
is the same, this is a purely cosmetic change.

Signed-off-by: Joe Stringer <joe@cilium.io>
When hairpinning, it's useful to have two datapoints for the proxy
redirect: One in the original program and one in the program that
actually prepares the context to send the packet to the proxy. The
ctx_redirect_to_proxy_hairpin one should occur in the original program,
so change it to DBG_CAPTURE_PROXY_PRE. The corresponding
DBG_CAPTURE_PROXY_POST should be handled in the "to-host" program later.

Signed-off-by: Joe Stringer <joe@cilium.io>
Add drop and debug notifications that will be used by upcoming commits
to emit informational events from the datapath when BPF TPROXY is
enabled.

Signed-off-by: Joe Stringer <joe@cilium.io>
Define these helpers for use in upcoming BPF TPROXY commits.

Signed-off-by: Joe Stringer <joe@cilium.io>
In preparation for implementing BPF TPROXY in ctx_redirect_to_proxy(),
pass the tuple into this function from all of its callers. In this
commit it will remain unused but in future commits this will begin to be
used in particular cases to perform a socket lookup in the callee.

When endpoint-routes mode is enabled, we're executing on TC egress which
can't make use of the tuple yet so we won't extract the tuple out at
this point (pass NULL). The "hairpin" redirect will redirect this back
to a device where we can perform the tuple extraction & socket lookup.

NOTE: At this point when passing the tuple in, I am making the
assumption that it is no longer needed at these call points so
ctx_redirect_to_proxy() is welcome to thrash it and use it as scratch
buffer as it searches for the relevant sockets. As such it is
deliberately not marked "const". If this assumption is invalid or
reviewers have suggestions how to alleviate this assumption, please
comment in the review.

If you're reading this in the future and have hit an issue due to
rewrites of the tuple in this function, then.... sorry? :-)

Signed-off-by: Joe Stringer <joe@cilium.io>
In cases where the traffic is being processed in a BPF program at TC
ingress, ie the traffic is already heading towards the host, we can
implement the socket lookup & assign logic directly in the proxy
redirect logic here. This applies, for instance, to traffic arriving
from an endpoint on the lxc device, or from a tunnel (via tail call to
the endpoint policy program).

In these cases, we attempt to first locate any full sockets
associated with the tuple, and if the connection is already established,
assign the corresponding socket to the skb to notify the stack to pass
the traffic into the socket. If the full socket cannot be found, then we
fall back to listen sockets for the proxy port and assign that one.

Signed-off-by: Joe Stringer <joe@cilium.io>
For stack packet receive ~reasons~, the ability to assign a socket to
the context is only supported at TC ingress. As such, when for instance
the host itself is sending traffic to endpoints on the same node, or
when the egress proxy is sending traffic towards an endpoint that also
has ingress L7 policy applied, we need to redirect the packet towards an
ingress TC program before assigning the socket to the context.

The process of redirecting the traffic towards an ingress qdisc was
already present as this is necessary to actually present the traffic to
the stack to send to the proxy; this commit extends that logic to
perform the same logic as the previous commit for socket assign from the
host ingress program.

In the case of the host ingress program, we don't have the luxury of
having already extracted the packet tuple for performing socket lookup,
so in this case we are required to perform this extraction again. (*)

(*) This was written pre-hostfw so there may be some room to perform
better sharing of the implementation for this case now.

Signed-off-by: Joe Stringer <joe@cilium.io>
This flag supports manually enabling or disabling the BPF-based tproxy
functionality as supported on Linux 5.7 or later. If it is enabled, then
the iptables rules which perform tproxy action will be disabled and the
equivalent functionality will be enabled in BPF in the datapath. Some
rules such as ones that configure the stack to ignore proxy flows and
avoid connection tracking them via iptables are left in place in case
the user still has iptables enabled in their environment.

Signed-off-by: Joe Stringer <joe@cilium.io>
Default it to enabled for new v1.9 installs, v1.8 upgrades will disable.

Signed-off-by: Joe Stringer <joe@cilium.io>
Add a test which deploys connectivity-check-proxy.yaml from the
examples/kubernetes/connectivity-check/ dirs, to validate the various
egress/ingress/dual proxy cases.

Related: cilium#12714

Signed-off-by: Joe Stringer <joe@cilium.io>
The upstream BPF TPROXY support doesn't support REUSEPORT sockets at
this time, so disable REUSEPORT if BPF TPROXY is enabled.

Signed-off-by: Joe Stringer <joe@cilium.io>
When making use of eBPF TPROXY, avoid setting bits in the upper portions
of the mark which would previously be used in non-eBPF TPROXY for
determining the transparent socket port. These bits do not need to be
set in this case, so leaving them unset helps to better integrate with
other potential users of the mark.

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Aug 26, 2020
@joestringer
Copy link
Member Author

test-me-please

@joestringer
Copy link
Member Author

This PR has now passed all required CI checks and has green codeowners. It does not address the "upgrade issue" (#12288) but makes some extensions to the YAMLs following the existing code. In the absence of further feedback in that area I will merge this as-is and we can follow up separately in that relevant issue on how to properly handle that case.

@joestringer joestringer merged commit 88688c3 into cilium:master Aug 26, 2020
@joestringer joestringer deleted the submit/bpf-tproxy branch August 26, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/major This PR introduces major new functionality to Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants