-
Notifications
You must be signed in to change notification settings - Fork 3.4k
bpf: Apply host firewall before NodePort SNATing #12011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: Apply host firewall before NodePort SNATing #12011
Conversation
We currently incorrectly enforce host policies after BPF-based NodePort SNATing (in to-netdev bpf_host section). Thus, at the time we enforce the host policies, a packet destined to a NodePort will have a source IP of the host. We may then drop that packet based on egress policies because we believe it comes from the host namespace. This commit fixes this, as well as the reverse path (reverse NAT). This change also allows us to simplify the code a bit by removing the skip_redirect checks. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Signed-off-by: Paul Chaignon <paul@cilium.io>
test-me-please |
@@ -755,6 +744,17 @@ handle_ipv4(struct __ctx_buff *ctx, __u32 secctx, bool from_host) | |||
if (!revalidate_data(ctx, &data, &data_end, &ip4)) | |||
return DROP_INVALID; | |||
|
|||
#ifdef ENABLE_HOST_FIREWALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of a request destined to a remote backend (NodePort SNAT), we don't recircle back to handle_ipv{4,6}()
from CILIUM_CALL_IPV{4,6}_NODEPORT_NAT
, so this particular code move looks good.
In the case of a reply from the remote backend, this code will see a packet |src=backendPodIP|dst=SNATIP
. Is it OK, or it is expecting to see a real dst IP (the original client IP)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of a packet destined to remote backend, we want to see |src=clientIP|dst=backendPodIP|
. In the case of a reply from remote backend, we want to see |src=backendPodIP|dst=clientIP|
. This PR doesn't give us the correct dst
in second case and that is why I switched it to draft mode yesterday evening.
So ideally, I would like the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT on reply from remote backend. The first case is easy to get because DNAT and SNAT are separated in nodeport_lb{4,6}
and nodeport_nat_fwd
respectively. The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}
; they are not decoupled.
As a temporary workaround, I could maybe:
- Save the source IP, i.e.,
backendPodIP
. - Call
nodeport_lb{4,6}
. - Enforce host policies with the saved source IP instead of the packet's new source IP.
That would only be a temporary workaround, because as far as I know, once we want to support L7 policies, the host firewall will need to forward the proper reply packet to the proxy, i.e., a packet with |src=backendPodIP|dst=clientIP|
.
Does that make sense? Am I understanding this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}
Have you considered doing the host-fw validation for this case in CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT
? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon. Just one crux is that it's also invoked for a case when a local backend sends a reply.
Does that make sense?
Anyway, the proposed solution makes sense too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered doing the host-fw validation for this case in
CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT
? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon.
Not really; I'm a bit wary of mixing the NodePort and host firewall code.
But yeah, from a quick look, it seems that the approach you propose would allow us to get packets exactly in the state we need them. So it would look something like this for the !from_host
path:
#ifdef ENABLE_NODEPORT
if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC &&
!bpf_skip_nodeport(ctx)) {
ret = nodeport_lb4(ctx, secctx);
if (ret < 0)
return ret;
# ifdef ENABLE_HOST_FIREWALL
// AFAIU, this call is needed to handle packets
// for which we don't take the tail call in nodeport_lb4().
// We enforce host policies here to avoid enforcing twice
// for packets for which we did take the tail call in nodeport_lb4().
ret = ipv4_host_policy_ingress(ctx);
if (IS_ERR(ret))
return ret;
# endif
}
#elif defined(ENABLE_HOST_FIREWALL)
ret = ipv4_host_policy_ingress(ctx);
if (IS_ERR(ret))
return ret;
#endif
with another call to ipv4_host_policy_ingress()
in tail_rev_nodeport_lb4()
.
When BPF-based NodePort is enabled, we would never call ipv4_host_policy_ingress()
for packets on their way to a remote backend, but I guess that's fine since the host policies wouldn't apply to those anyway.
As for the verifier complexity impact, we would probably be fine in CILIUM_CALL_IPV{4,6}_FROM_LXC
because, AFAICS, the above code would results in a few less possible paths through the CFG than we currently have (because the ipv4_host_policy_ingress()
call is in the bpf_skip_nodeport()
condition).
The verifier complexity (and program sizes) will increase for CILIUM_CALL_IPV{4,6}_NODEPORT_NAT
however. Do you know if those programs are close to the limits currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if those programs are close to the limits currently?
I haven't seen the verifier complaying about this section, so I assume we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work too?
#ifdef ENABLE_NODEPORT
if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC &&
!bpf_skip_nodeport(ctx)) {
ret = nodeport_lb4(ctx, secctx);
if (ret < 0)
return ret;
}
#endif
#if defined(ENABLE_HOST_FIREWALL)
ret = ipv4_host_policy_ingress(ctx);
if (IS_ERR(ret))
return ret;
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, I'm not sure what I had in mind. We will not hit the host policy enforcement twice since in the first case we take the tail call and don't come back to _FROM_LXC
👍
(!defined(ENABLE_DSR) || \ | ||
(defined(ENABLE_DSR) && defined(ENABLE_DSR_HYBRID))) | ||
if ((ctx->mark & MARK_MAGIC_SNAT_DONE) != MARK_MAGIC_SNAT_DONE) { | ||
ret = nodeport_nat_fwd(ctx, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the host fw will see the pod IP as src, as BPF masquerade can happen in nodeport_nat_fwd
.
We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
Closing in favor of #12345. |
We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 56a80e7 ] We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 56a80e7 ] We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 56a80e7 ] We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
We currently incorrectly enforce host policies after BPF-based NodePort SNATing (in
to-netdev
bpf_host
section). Thus, at the time we enforce the host policies, a packet destined to a NodePort will have a source IP of the host. We may then drop that packet based on egress policies because we believe it comes from the host namespace.For example, in the following example, packet (2) might get dropped due to the egress policies of Worker Node A because the packet now has source IP

1.1.1.1
.This commit fixes this, as well as the reverse path (reverse NAT). This change also allows us to simplify the code a bit by removing the
skip_redirect
checks.Fixes: #11507
Updates: #11799
/cc @borkmann You might need to rebase #11977 to include this commit instead of the previous partial fix you cherry-picked.