Skip to content

Conversation

pchaigno
Copy link
Member

We currently incorrectly enforce host policies after BPF-based NodePort SNATing (in to-netdev bpf_host section). Thus, at the time we enforce the host policies, a packet destined to a NodePort will have a source IP of the host. We may then drop that packet based on egress policies because we believe it comes from the host namespace.

For example, in the following example, packet (2) might get dropped due to the egress policies of Worker Node A because the packet now has source IP 1.1.1.1.
cilium-snat-illustration

This commit fixes this, as well as the reverse path (reverse NAT). This change also allows us to simplify the code a bit by removing the skip_redirect checks.

Fixes: #11507
Updates: #11799
/cc @borkmann You might need to rebase #11977 to include this commit instead of the previous partial fix you cherry-picked.

We currently incorrectly enforce host policies after BPF-based NodePort
SNATing (in to-netdev bpf_host section). Thus, at the time we enforce
the host policies, a packet destined to a NodePort will have a source IP
of the host. We may then drop that packet based on egress policies because
we believe it comes from the host namespace.

This commit fixes this, as well as the reverse path (reverse NAT). This
change also allows us to simplify the code a bit by removing the
skip_redirect checks.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added priority/release-blocker release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 10, 2020
@pchaigno pchaigno requested review from brb and a team June 10, 2020 18:01
@pchaigno
Copy link
Member Author

test-me-please

@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 37.05% when pulling 8ad986a on pr/pchaigno/apply-host-fw-before-nodeport-snatting into 8a02976 on master.

@pchaigno pchaigno marked this pull request as draft June 10, 2020 20:22
@@ -755,6 +744,17 @@ handle_ipv4(struct __ctx_buff *ctx, __u32 secctx, bool from_host)
if (!revalidate_data(ctx, &data, &data_end, &ip4))
return DROP_INVALID;

#ifdef ENABLE_HOST_FIREWALL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a request destined to a remote backend (NodePort SNAT), we don't recircle back to handle_ipv{4,6}() from CILIUM_CALL_IPV{4,6}_NODEPORT_NAT, so this particular code move looks good.

In the case of a reply from the remote backend, this code will see a packet |src=backendPodIP|dst=SNATIP. Is it OK, or it is expecting to see a real dst IP (the original client IP)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a packet destined to remote backend, we want to see |src=clientIP|dst=backendPodIP|. In the case of a reply from remote backend, we want to see |src=backendPodIP|dst=clientIP|. This PR doesn't give us the correct dst in second case and that is why I switched it to draft mode yesterday evening.

So ideally, I would like the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT on reply from remote backend. The first case is easy to get because DNAT and SNAT are separated in nodeport_lb{4,6} and nodeport_nat_fwd respectively. The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}; they are not decoupled.

As a temporary workaround, I could maybe:

  • Save the source IP, i.e., backendPodIP.
  • Call nodeport_lb{4,6}.
  • Enforce host policies with the saved source IP instead of the packet's new source IP.

That would only be a temporary workaround, because as far as I know, once we want to support L7 policies, the host firewall will need to forward the proper reply packet to the proxy, i.e., a packet with |src=backendPodIP|dst=clientIP|.

Does that make sense? Am I understanding this correctly?

Copy link
Member

@brb brb Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second case, for replies, is harder to get because all reverse NAT operations seem to happen in nodeport_lb{4,6}

Have you considered doing the host-fw validation for this case in CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon. Just one crux is that it's also invoked for a case when a local backend sends a reply.

Does that make sense?

Anyway, the proposed solution makes sense too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered doing the host-fw validation for this case in CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT? This is exactly the place in which a packet is already rev-SNAT xlated, but DNAT is going to be performed soon.

Not really; I'm a bit wary of mixing the NodePort and host firewall code.
But yeah, from a quick look, it seems that the approach you propose would allow us to get packets exactly in the state we need them. So it would look something like this for the !from_host path:

#ifdef ENABLE_NODEPORT
if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC &&
    !bpf_skip_nodeport(ctx)) {
	ret = nodeport_lb4(ctx, secctx);
	if (ret < 0)
		return ret;
# ifdef ENABLE_HOST_FIREWALL
	// AFAIU, this call is needed to handle packets
	// for which we don't take the tail call in nodeport_lb4().
	// We enforce host policies here to avoid enforcing twice
	// for packets for which we did take the tail call in nodeport_lb4().
	ret = ipv4_host_policy_ingress(ctx);
	if (IS_ERR(ret))
		return ret;
# endif
}
#elif defined(ENABLE_HOST_FIREWALL)
ret = ipv4_host_policy_ingress(ctx);
if (IS_ERR(ret))
	return ret;
#endif

with another call to ipv4_host_policy_ingress() in tail_rev_nodeport_lb4().

When BPF-based NodePort is enabled, we would never call ipv4_host_policy_ingress() for packets on their way to a remote backend, but I guess that's fine since the host policies wouldn't apply to those anyway.

As for the verifier complexity impact, we would probably be fine in CILIUM_CALL_IPV{4,6}_FROM_LXC because, AFAICS, the above code would results in a few less possible paths through the CFG than we currently have (because the ipv4_host_policy_ingress() call is in the bpf_skip_nodeport() condition).

The verifier complexity (and program sizes) will increase for CILIUM_CALL_IPV{4,6}_NODEPORT_NAT however. Do you know if those programs are close to the limits currently?

Copy link
Member

@brb brb Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if those programs are close to the limits currently?

I haven't seen the verifier complaying about this section, so I assume we should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work too?

#ifdef ENABLE_NODEPORT
if (ctx_get_xfer(ctx) != XFER_PKT_NO_SVC &&
    !bpf_skip_nodeport(ctx)) {
	ret = nodeport_lb4(ctx, secctx);
	if (ret < 0)
		return ret;
}
#endif
#if defined(ENABLE_HOST_FIREWALL)
ret = ipv4_host_policy_ingress(ctx);
if (IS_ERR(ret))
	return ret;
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I'm not sure what I had in mind. We will not hit the host policy enforcement twice since in the first case we take the tail call and don't come back to _FROM_LXC 👍

(!defined(ENABLE_DSR) || \
(defined(ENABLE_DSR) && defined(ENABLE_DSR_HYBRID)))
if ((ctx->mark & MARK_MAGIC_SNAT_DONE) != MARK_MAGIC_SNAT_DONE) {
ret = nodeport_nat_fwd(ctx, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the host fw will see the pod IP as src, as BPF masquerade can happen in nodeport_nat_fwd.

pchaigno added a commit that referenced this pull request Jul 6, 2020
We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Jul 6, 2020

Closing in favor of #12345.

@pchaigno pchaigno closed this Jul 6, 2020
@pchaigno pchaigno deleted the pr/pchaigno/apply-host-fw-before-nodeport-snatting branch July 6, 2020 06:25
pchaigno added a commit that referenced this pull request Jul 15, 2020
We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 16, 2020
We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
[ upstream commit 56a80e7 ]

We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
@pchaigno pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020
pchaigno added a commit that referenced this pull request Jul 21, 2020
[ upstream commit 56a80e7 ]

We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
rolinh pushed a commit that referenced this pull request Jul 21, 2020
[ upstream commit 56a80e7 ]

We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/host-firewall Impacts the host firewall or the host endpoint. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants