bpf: Apply host firewall before NodePort SNATing #12011

pchaigno · 2020-06-10T18:01:32Z

We currently incorrectly enforce host policies after BPF-based NodePort SNATing (in to-netdev bpf_host section). Thus, at the time we enforce the host policies, a packet destined to a NodePort will have a source IP of the host. We may then drop that packet based on egress policies because we believe it comes from the host namespace.

For example, in the following example, packet (2) might get dropped due to the egress policies of Worker Node A because the packet now has source IP 1.1.1.1.

This commit fixes this, as well as the reverse path (reverse NAT). This change also allows us to simplify the code a bit by removing the skip_redirect checks.

Fixes: #11507
Updates: #11799
/cc @borkmann You might need to rebase #11977 to include this commit instead of the previous partial fix you cherry-picked.

We currently incorrectly enforce host policies after BPF-based NodePort SNATing (in to-netdev bpf_host section). Thus, at the time we enforce the host policies, a packet destined to a NodePort will have a source IP of the host. We may then drop that packet based on egress policies because we believe it comes from the host namespace. This commit fixes this, as well as the reverse path (reverse NAT). This change also allows us to simplify the code a bit by removing the skip_redirect checks. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2020-06-10T18:52:12Z

test-me-please

coveralls · 2020-06-10T18:53:39Z

Coverage increased (+0.004%) to 37.05% when pulling 8ad986a on pr/pchaigno/apply-host-fw-before-nodeport-snatting into 8a02976 on master.

brb · 2020-06-11T08:27:02Z

bpf/bpf_host.c

@@ -755,6 +744,17 @@ handle_ipv4(struct __ctx_buff *ctx, __u32 secctx, bool from_host)
 	if (!revalidate_data(ctx, &data, &data_end, &ip4))
 		return DROP_INVALID;

+#ifdef ENABLE_HOST_FIREWALL


In the case of a request destined to a remote backend (NodePort SNAT), we don't recircle back to handle_ipv{4,6}() from CILIUM_CALL_IPV{4,6}_NODEPORT_NAT, so this particular code move looks good.

In the case of a reply from the remote backend, this code will see a packet |src=backendPodIP|dst=SNATIP. Is it OK, or it is expecting to see a real dst IP (the original client IP)?

In the case of a packet destined to remote backend, we want to see |src=clientIP|dst=backendPodIP|. In the case of a reply from remote backend, we want to see |src=backendPodIP|dst=clientIP|. This PR doesn't give us the correct dst in second case and that is why I switched it to draft mode yesterday evening.

So ideally, I would like the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT on reply from remote backend. The first case is easy to get because DNAT and SNAT are separated in nodeport_lb{4,6} and nodeport_nat_fwd respectively. The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}; they are not decoupled.

As a temporary workaround, I could maybe:

Save the source IP, i.e., backendPodIP.

Call nodeport_lb{4,6}.

Enforce host policies with the saved source IP instead of the packet's new source IP.

That would only be a temporary workaround, because as far as I know, once we want to support L7 policies, the host firewall will need to forward the proper reply packet to the proxy, i.e., a packet with |src=backendPodIP|dst=clientIP|.

Does that make sense? Am I understanding this correctly?

The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}

Have you considered doing the host-fw validation for this case in CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon. Just one crux is that it's also invoked for a case when a local backend sends a reply.

Does that make sense?

Anyway, the proposed solution makes sense too.

Have you considered doing the host-fw validation for this case in CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon.

Not really; I'm a bit wary of mixing the NodePort and host firewall code.
But yeah, from a quick look, it seems that the approach you propose would allow us to get packets exactly in the state we need them. So it would look something like this for the !from_host path:

#ifdef ENABLE_NODEPORT if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC && !bpf_skip_nodeport(ctx)) { ret = nodeport_lb4(ctx, secctx); if (ret < 0) return ret; # ifdef ENABLE_HOST_FIREWALL // AFAIU, this call is needed to handle packets // for which we don't take the tail call in nodeport_lb4(). // We enforce host policies here to avoid enforcing twice // for packets for which we did take the tail call in nodeport_lb4(). ret = ipv4_host_policy_ingress(ctx); if (IS_ERR(ret)) return ret; # endif } #elif defined(ENABLE_HOST_FIREWALL) ret = ipv4_host_policy_ingress(ctx); if (IS_ERR(ret)) return ret; #endif

with another call to ipv4_host_policy_ingress() in tail_rev_nodeport_lb4().

When BPF-based NodePort is enabled, we would never call ipv4_host_policy_ingress() for packets on their way to a remote backend, but I guess that's fine since the host policies wouldn't apply to those anyway.

As for the verifier complexity impact, we would probably be fine in CILIUM_CALL_IPV{4,6}_FROM_LXC because, AFAICS, the above code would results in a few less possible paths through the CFG than we currently have (because the ipv4_host_policy_ingress() call is in the bpf_skip_nodeport() condition).

The verifier complexity (and program sizes) will increase for CILIUM_CALL_IPV{4,6}_NODEPORT_NAT however. Do you know if those programs are close to the limits currently?

Do you know if those programs are close to the limits currently?

I haven't seen the verifier complaying about this section, so I assume we should be fine.

I think this would work too?

#ifdef ENABLE_NODEPORT if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC && !bpf_skip_nodeport(ctx)) { ret = nodeport_lb4(ctx, secctx); if (ret < 0) return ret; } #endif #if defined(ENABLE_HOST_FIREWALL) ret = ipv4_host_policy_ingress(ctx); if (IS_ERR(ret)) return ret; #endif

Ah right, I'm not sure what I had in mind. We will not hit the host policy enforcement twice since in the first case we take the tail call and don't come back to _FROM_LXC 👍

brb · 2020-06-11T08:34:38Z

bpf/bpf_host.c

+	(!defined(ENABLE_DSR) || \
+	 (defined(ENABLE_DSR) && defined(ENABLE_DSR_HYBRID)))
+	if ((ctx->mark & MARK_MAGIC_SNAT_DONE) != MARK_MAGIC_SNAT_DONE) {
+		ret = nodeport_nat_fwd(ctx, false);


In this case, the host fw will see the pod IP as src, as BPF masquerade can happen in nodeport_nat_fwd.

We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2020-07-06T06:25:07Z

Closing in favor of #12345.

We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 56a80e7 ] We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 56a80e7 ] We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno added priority/release-blocker release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 10, 2020

pchaigno requested review from brb and a team June 10, 2020 18:01

pchaigno mentioned this pull request Jun 10, 2020

Followups for host endpoint and firewall #11799

Closed

34 tasks

pchaigno marked this pull request as draft June 10, 2020 20:22

brb reviewed Jun 11, 2020

View reviewed changes

pchaigno removed the priority/release-blocker label Jun 15, 2020

pchaigno closed this Jul 6, 2020

pchaigno deleted the pr/pchaigno/apply-host-fw-before-nodeport-snatting branch July 6, 2020 06:25

pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020

qmonnet removed the needs-backport/1.8 label Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: Apply host firewall before NodePort SNATing #12011

bpf: Apply host firewall before NodePort SNATing #12011

Uh oh!

pchaigno commented Jun 10, 2020

Uh oh!

pchaigno commented Jun 10, 2020

Uh oh!

coveralls commented Jun 10, 2020

Uh oh!

brb Jun 11, 2020

Uh oh!

pchaigno Jun 11, 2020

Uh oh!

brb Jun 11, 2020 •

edited

Loading

Uh oh!

pchaigno Jun 11, 2020

Uh oh!

brb Jun 12, 2020 •

edited

Loading

Uh oh!

brb Jun 12, 2020

Uh oh!

pchaigno Jun 12, 2020

Uh oh!

brb Jun 11, 2020

Uh oh!

pchaigno commented Jul 6, 2020

Uh oh!

Uh oh!

bpf: Apply host firewall before NodePort SNATing #12011

bpf: Apply host firewall before NodePort SNATing #12011

Uh oh!

Conversation

pchaigno commented Jun 10, 2020

Uh oh!

pchaigno commented Jun 10, 2020

Uh oh!

coveralls commented Jun 10, 2020

Uh oh!

brb Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

pchaigno Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

brb Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchaigno Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

brb Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brb Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

pchaigno Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

brb Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Jul 6, 2020

Uh oh!

Uh oh!

brb Jun 11, 2020 •

edited

Loading

brb Jun 12, 2020 •

edited

Loading