-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Implement proxy redirection logic in eBPF #11279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
Coverage decreased (-0.003%) to 37.134% when pulling 49693ef219db40ca8eae710c11a368afc74cf5bd on joestringer:submit/bpf-tproxy into bd89e83 on cilium:master. |
453394c
to
dab004c
Compare
This comment has been minimized.
This comment has been minimized.
dab004c
to
d88b94a
Compare
This comment has been minimized.
This comment has been minimized.
cfc9a2f
to
6858357
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
6858357
to
4522290
Compare
This comment has been minimized.
This comment has been minimized.
4522290
to
4994c61
Compare
4994c61
to
09263c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI and Helm changes LGTM.
Use the define here to improve readability and locating the code that changes specific offsets in the context metadata (eg skb->cb). The value is the same, this is a purely cosmetic change. Signed-off-by: Joe Stringer <joe@cilium.io>
When hairpinning, it's useful to have two datapoints for the proxy redirect: One in the original program and one in the program that actually prepares the context to send the packet to the proxy. The ctx_redirect_to_proxy_hairpin one should occur in the original program, so change it to DBG_CAPTURE_PROXY_PRE. The corresponding DBG_CAPTURE_PROXY_POST should be handled in the "to-host" program later. Signed-off-by: Joe Stringer <joe@cilium.io>
Add drop and debug notifications that will be used by upcoming commits to emit informational events from the datapath when BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>
Define these helpers for use in upcoming BPF TPROXY commits. Signed-off-by: Joe Stringer <joe@cilium.io>
In preparation for implementing BPF TPROXY in ctx_redirect_to_proxy(), pass the tuple into this function from all of its callers. In this commit it will remain unused but in future commits this will begin to be used in particular cases to perform a socket lookup in the callee. When endpoint-routes mode is enabled, we're executing on TC egress which can't make use of the tuple yet so we won't extract the tuple out at this point (pass NULL). The "hairpin" redirect will redirect this back to a device where we can perform the tuple extraction & socket lookup. NOTE: At this point when passing the tuple in, I am making the assumption that it is no longer needed at these call points so ctx_redirect_to_proxy() is welcome to thrash it and use it as scratch buffer as it searches for the relevant sockets. As such it is deliberately not marked "const". If this assumption is invalid or reviewers have suggestions how to alleviate this assumption, please comment in the review. If you're reading this in the future and have hit an issue due to rewrites of the tuple in this function, then.... sorry? :-) Signed-off-by: Joe Stringer <joe@cilium.io>
In cases where the traffic is being processed in a BPF program at TC ingress, ie the traffic is already heading towards the host, we can implement the socket lookup & assign logic directly in the proxy redirect logic here. This applies, for instance, to traffic arriving from an endpoint on the lxc device, or from a tunnel (via tail call to the endpoint policy program). In these cases, we attempt to first locate any full sockets associated with the tuple, and if the connection is already established, assign the corresponding socket to the skb to notify the stack to pass the traffic into the socket. If the full socket cannot be found, then we fall back to listen sockets for the proxy port and assign that one. Signed-off-by: Joe Stringer <joe@cilium.io>
For stack packet receive ~reasons~, the ability to assign a socket to the context is only supported at TC ingress. As such, when for instance the host itself is sending traffic to endpoints on the same node, or when the egress proxy is sending traffic towards an endpoint that also has ingress L7 policy applied, we need to redirect the packet towards an ingress TC program before assigning the socket to the context. The process of redirecting the traffic towards an ingress qdisc was already present as this is necessary to actually present the traffic to the stack to send to the proxy; this commit extends that logic to perform the same logic as the previous commit for socket assign from the host ingress program. In the case of the host ingress program, we don't have the luxury of having already extracted the packet tuple for performing socket lookup, so in this case we are required to perform this extraction again. (*) (*) This was written pre-hostfw so there may be some room to perform better sharing of the implementation for this case now. Signed-off-by: Joe Stringer <joe@cilium.io>
This flag supports manually enabling or disabling the BPF-based tproxy functionality as supported on Linux 5.7 or later. If it is enabled, then the iptables rules which perform tproxy action will be disabled and the equivalent functionality will be enabled in BPF in the datapath. Some rules such as ones that configure the stack to ignore proxy flows and avoid connection tracking them via iptables are left in place in case the user still has iptables enabled in their environment. Signed-off-by: Joe Stringer <joe@cilium.io>
Default it to enabled for new v1.9 installs, v1.8 upgrades will disable. Signed-off-by: Joe Stringer <joe@cilium.io>
Add a test which deploys connectivity-check-proxy.yaml from the examples/kubernetes/connectivity-check/ dirs, to validate the various egress/ingress/dual proxy cases. Related: cilium#12714 Signed-off-by: Joe Stringer <joe@cilium.io>
The upstream BPF TPROXY support doesn't support REUSEPORT sockets at this time, so disable REUSEPORT if BPF TPROXY is enabled. Signed-off-by: Joe Stringer <joe@cilium.io>
When making use of eBPF TPROXY, avoid setting bits in the upper portions of the mark which would previously be used in non-eBPF TPROXY for determining the transparent socket port. These bits do not need to be set in this case, so leaving them unset helps to better integrate with other potential users of the mark. Suggested-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
827f7b6
to
691449d
Compare
test-me-please |
This PR has now passed all required CI checks and has green codeowners. It does not address the "upgrade issue" (#12288) but makes some extensions to the YAMLs following the existing code. In the absence of further feedback in that area I will merge this as-is and we can follow up separately in that relevant issue on how to properly handle that case. |
Summary
Introduce a new implementation of the proxy redirect handling in the eBPF datapath using a combination of socket lookups and socket assign (+ packet redirects where necessary to ensure that socket assign occurs at TC ingress).
This functionality is disabled by default in the daemon, but enabled in new v1.9 deployments via Helm. When enabled on Linux 5.7 or later, it removes use of the
-j TPROXY
target, and removes Cilium's usage of the upper 16 bits of the mark for the purpose of sharing the tproxy target port with iptables rules.API impact
Three new monitor error codes are added for cases where the redirect cannot be applied, meaning that traffic subject to L7 policy will be dropped if there is a configuration or lookup issue:
DROP_PROXY_LOOKUP_FAILED
DROP_PROXY_SET_FAILED
SO_REUSEPORT
). In the event of this error, the default monitor logs will additionally provide an errno to determine which subfailure occurred.DROP_PROXY_UNKNOWN_PROTO
Implementation
Previously, in the eBPF datapath during policy apply and redirect processing, we have relied purely on the tproxy port which is available in the policymap to prepare the packet for TPROXY redirection. Mechanically this path involved performing a policy lookup using the src/dst identity, then in the policymap entry, using non-zero redirect port as a signal for redirect. This port would then be set into the
ctx->mark
along with some bits to identify this traffic as "needs proxy redirection". Then the packet would be passed up the stack towards an iptables PREROUTING rule which would match on the mark then perform a socket lookup and assign the socket to the skb.The first step of the new implementation is to ensure that the relevant forwarding logic has access to the full packet tuple. There is one fairly significant condition for applying the socket to the skb: It must be done at TC ingress. In many cases, Cilium already applies all logic at TC ingress eBPF programs so this isn't too difficult. For
endpoint-routes
mode however we apply pod ingress policy at the TC egress hook point, and therefore must first hairpin to another device, then extract the tuple & perform the socket lookups to assign the socket to the skb.With this in place, the actual lookup & assign is relatively straightforward. First we must look for any established connections corresponding to the tuple in the local network namespace.These represent existing proxy connections that were already established. If we find the socket here, we assign it to the skb & pass up the stack. If we don't find it, then we zero out the tuple and set the TPROXY redirect port from the policy into the tuple and perform a second lookup for listen sockets. This should provide the parent listener socket associated with the proxy which we can assign to establish a new connection in the proxy.
For now, this PR introduces dedicated tuple extraction logic in
bpf_host.c
for the endpoint routes mode case rather than sharing it with host firewall, primarily because it predates host firewall but also because the tuple extraction needs to be performed if eBPF tproxy is enabled but host firewall is not. A future cleanup could look at shifting tuple extraction into common code which would be enabled by either path and shared. Further optimizations could be explored to perhaps avoid further packet touching in this prog by sharing some of that metadata from the previous program that was run at TC egress of the endpoint subject to L7 policy.This PR replaces the per-redirect iptables rule logic but not the static
-m socket
logic as this requires further exploration to determine the simplest path forward. Fortunately, the newly added connectivity checks for proxy paths makes it easy to reproduce failures on these paths to allow investigation and experimentation.Testing
The recently-added
examples/kubernetes/connectivity-check-proxy.yaml
is now added to the main CI for testing various proxy paths to prevent regressions. Additionally I have manually deployed this YAML for debugging in a GKE environment (default instructions,endpoint-routes
mode).The
BPF checks / checkpatch (pull_request)
GitHub action job doesn't seem to like my use of macros to avoid duplicating IPv4/IPv6 implementations, but I've looked through the complaints and AFAICT they're all safe to ignore.Tasks
Deferred to follow up PRs
-m socket
rules - I looked briefly into this, should be doable in one of a couple of ways but it's more involved so I figured it's better to get the basics in first.I marked this 'release-note/major' with the (ambitious?) goal of it being listed with other PRs as part of iptables removal for v1.9, but ideally it should have no user-facing impact :-) Happy to move it to /minor if we think this is not v1.9 release higlight worthy.
Related: #9921