Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jun 30, 2020

This pull request fixes various host firewall bugs. Please see the release notes and review commit by commit.

How I Tested This

I'm still tracking some bugs in the host firewall, so seems important to highlight what I tested and is known to work.

I tested the host firewall with the ingress+egress L3+L4 example policy included at the end of this series. It attempts to whitelist connections in the most fine-grained way possible. Specifically, I launched the connectivity-test and loaded that policy in the net-next development VMs, with options kube-proxy-replacement=strict, enable-remote-node-identity, enable-host-firewall, and the devices set.

I then checked that the connectivity check was okay for 20min+ and verified all packet drops were expected.

Impact on CI

The host firewall is enabled in our Packet K8s CI builds. The main reason to enable it was to be able to detect any regression due to BPF complexity issues (it adds a lot of code to the datapath). That was fine because having the host firewall enabled with allow-all default shouldn't impact anything in Cilium (apart from dropping packets if there's a bug).

However, with the present fixes, enabling the host firewall now changes the path of some packets to force them through the firewall's ingress. That means we are really not testing the default setup (i.e., host firewall disabled) the users are likely to have.

I will therefore send a follow up PR to only enable the host firewall in CI if we have some label set in the PR (as we do for other features).

Updates: #11799

Fix incorrect host firewall enforcement when used with BPF-based NodePort services
Fix host firewall ingress bypass on path from pods to local host
Fix potential ingress host firewall bypass in tunneling mode for remote pods
Fix handling of ICMPv6 messages by host firewall
Fix failure to recognize established IPv6 connections on egress of the host firewall
Fix potential host firewall drops on egress of the node in case of SNAT

@pchaigno pchaigno added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jun 30, 2020
@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 5382653 to 1c86738 Compare June 30, 2020 20:27
@pchaigno pchaigno added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. needs-backport/1.8 labels Jun 30, 2020
@coveralls

This comment has been minimized.

@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 1c86738 to 7e0d055 Compare July 1, 2020 10:54
@pchaigno pchaigno added the kind/bug This is a bug in the Cilium logic. label Jul 1, 2020
@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch 4 times, most recently from 5646508 to b603d50 Compare July 2, 2020 06:53
@pchaigno pchaigno marked this pull request as ready for review July 2, 2020 09:01
@pchaigno pchaigno requested a review from a team July 2, 2020 09:01
@pchaigno pchaigno requested review from a team as code owners July 2, 2020 09:01
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a few minor nits, the PR looks good to me. Thanks for this work!

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI bits lgtm

@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from b603d50 to c4bf244 Compare July 6, 2020 06:24
@pchaigno pchaigno requested a review from qmonnet July 6, 2020 06:29
@pchaigno pchaigno requested a review from a team July 8, 2020 10:17
@brb
Copy link
Member

brb commented Jul 8, 2020

Re We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend from 69c4efe: please correct me if I'm wrong, but I think we might still miss this case, as in the case of a remote backend, CILIUM_CALL_IPV{4,6}_NODEPORT_NAT (taill-called by nodeport_lb{4,6}()) will not recircle back to bpf_host and will invoke ctx_redirect().

@pchaigno
Copy link
Member Author

pchaigno commented Jul 8, 2020

Re We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend from 69c4efe: please correct me if I'm wrong, but I think we might still miss this case, as in the case of a remote backend, CILIUM_CALL_IPV{4,6}_NODEPORT_NAT (taill-called by nodeport_lb{4,6}()) will not recircle back to bpf_host and will invoke ctx_redirect().

Ah, right. So if the backend is remote, I can just skip the ingress host policy enforcement. Is it correct to assume that if CILIUM_CALL_IPV{4,6}_NODEPORT_NAT is called with NAT_DIR_EGRESS, then the backend is remote? If that's always true, I could maybe set a mark to skip the host firewall (I'm guessing ctx->tc_index isn't preserved on redirects?).

@brb
Copy link
Member

brb commented Jul 9, 2020

Is it correct to assume that if CILIUM_CALL_IPV{4,6}_NODEPORT_NAT is called with NAT_DIR_EGRESS, then the backend is remote?

Yes:

  • NAT_DIR_EGRESS - a request to a remote backend.
  • NAT_DIR_INGRESS - a reply from a remote backend + other packets.

@pchaigno pchaigno requested a review from a team as a code owner July 15, 2020 20:50
@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from d430a28 to 0425a2c Compare July 15, 2020 20:54
pchaigno added 9 commits July 16, 2020 16:45
The host firewall functions will be called from nodeport code in next
commit. A few minor changes were also made in the code moved to please
checkpatch.

Signed-off-by: Paul Chaignon <paul@cilium.io>
In handle_ipv{4,6}, we implement several operations:

  1. Handle NodePort services
  2. (Handle ICMPv6 messages)
  3. Swap MAC addresses
  4. Enforce host policies

However, NodePort services do not process ICMPv6 messages so we can just
process those first. In addition, it is useless to swap MAC addresses if
the packet is going to be dropped by host policies next. Thus, this
commit reorders these operations as follows.

  1. (Handle ICMPv6 messages)
  2. Handle NodePort services
  3. Enforce host policies
  4. Swap MAC addresses

Signed-off-by: Paul Chaignon <paul@cilium.io>
We want the host firewall to see ingress-to-host packets (1) after the
DNAT but before the SNAT on the way to the remote backend, and (2) after
the reverse SNAT but before the reverse DNAT for replies from remote
backend. This way, we always see packets with the backend pod's IP and the
client's IP.

Packets on their way to a remote backend will be redirect by
nodeport_lb{4,6} to the egress of the native device (towards the remote
node). Egress host policies will be bypassed for these packets, as they
should, because they won't have the mark indicating they come from the
host namespace. See subsequent commit in this serie where we introduce
the check on the mark in to-netdev.

See #12011 for details and discussion on this change.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Fixes: 489dbef ("bpf: Enforce host policies for IPv6")
Related: #12011 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
The copy of the source IPv6 address into the tuple used for conntrack
lookup was dropped during a previous rebase. This bug can cause
connections to fail when egress host policies are enforced in IPv6.

Signed-off-by: Paul Chaignon <paul@cilium.io>
For ICMPv6 messages we allow, we need to bypass the host firewall. We
cannot simply pass these messages up the stack because they are not
destined to the host.

I rebased 5244b68 not long before it was merged and likely introduced
the bug at that point. I validated the rebase with an ingress host
policy instead of an egress, and missed the regression. This time I
validated with an extensive ingress+egress L3L4 host policy included at
the end of this series.

Fixes: 5244b68 ("bpf: Hande icmpv6 in host firewall")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Packets egressing on the native device (i.e., to-netdev in bpf_host)
may be SNATed and therefore have the source IP address of the host
device. We need to take that into account when resolving the srcID to
avoid matching host policies on packets that egress from pods.

To that end, we rely on the same MARK_MAGIC_HOST mark (marking packets
as they egress the host namespace) as for packets egressing the host
namespace on the host device.

Fixes: 88bf291 ("bpf: Enforce host policies for IPv4")
Signed-off-by: Paul Chaignon <paul@cilium.io>
This policy can be used to test the host firewall. It has been tested on
the development VMs with kube-proxy-replacement=strict,
enable-remote-node-identity, and the devices configured. It attempts to
define the most fine-grained possible policies (L3+L4) on both ingress and
egress such that the connectivity-check works.

The L3-only rules are there to ensure ICMP echo messages between nodes
and with the health endpoint are not dropped. The connectivity-check
still works if these rules are removed.

This policy shouldn't be used outside of the dev. VM context as (1) it
is likely more extensive than necessary and (2) some adjustments would
be necessary to prevent breakages.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 0425a2c to 82c746c Compare July 16, 2020 14:46
@pchaigno
Copy link
Member Author

test-me-please

@joestringer
Copy link
Member

BPF checks / GKE are the only failures, BPF checks is false positive and GKE is not required. Merging.

@joestringer joestringer merged commit f55ec90 into master Jul 16, 2020
@joestringer joestringer deleted the pr/pchaigno/host-firewall-fixes branch July 16, 2020 18:22
@pchaigno pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020
christarazi pushed a commit that referenced this pull request Jul 20, 2020
[ upstream commit 9654cb4 ]

The host firewall currently only works if remote node identities are
distinct from the host identity (i.e., REMOTE_NODE_ID is supported).

It's not clear if this restriction can be removed in the future,
especially once #12345 is merged. In particular, #12345 will redirect
packets from local endpoints through the host device if the destination
security ID is HOST_ID. If the HOST_ID includes remote nodes, then this
redirection is likely to result in packet drops.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
@pchaigno pchaigno mentioned this pull request Jul 25, 2020
pchaigno added a commit to pchaigno/cilium that referenced this pull request May 25, 2021
Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced
concurrency groups for workflows triggered by comments. In each
concurrency group, a single workflow can be running at any time, with
previous workflows cancelled when more recent are scheduled.

However, in the context of comment-triggered workflows, a workflow is
triggered for every single comment in the pull request. The actual tests
on the other hand are only triggered for specific comments. But even if
those comments don't contain a phrase that triggers the test (e.g.,
test-me-please or ci-gke), they will cancel previously-running workflows.

To fix this, we need to ensure that the concurrency group with comments
that trigger tests does not include any comments which don't trigger
tests. We can achieve that by appending the actual comment text to the
concurrency group name.

So for example, a comment with "test-me-please" on PR 12345 will trigger
a workflow which belong to concurrency group:

    ConformanceEKS (ci-eks) cilium#12345 test-me-please

If GKE tests are then triggered with ci-gke, the new workflow will
belong to a second concurrency group and won't cancel the first:

    ConformanceEKS (ci-eks) cilium#12345 ci-gke

That is probably okay since it will preserve most of the benefits of
concurrency groups without cancelling everything as soon as someone
posts a comment.

Fixes: 7e953b9 (".github: Cancel outdated comment workflows")
Signed-off-by: Paul Chaignon <paul@cilium.io>
twpayne pushed a commit that referenced this pull request May 26, 2021
Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced
concurrency groups for workflows triggered by comments. In each
concurrency group, a single workflow can be running at any time, with
previous workflows cancelled when more recent are scheduled.

However, in the context of comment-triggered workflows, a workflow is
triggered for every single comment in the pull request. The actual tests
on the other hand are only triggered for specific comments. But even if
those comments don't contain a phrase that triggers the test (e.g.,
test-me-please or ci-gke), they will cancel previously-running workflows.

To fix this, we need to ensure that the concurrency group with comments
that trigger tests does not include any comments which don't trigger
tests. We can achieve that by appending the actual comment text to the
concurrency group name.

So for example, a comment with "test-me-please" on PR 12345 will trigger
a workflow which belong to concurrency group:

    ConformanceEKS (ci-eks) #12345 test-me-please

If GKE tests are then triggered with ci-gke, the new workflow will
belong to a second concurrency group and won't cancel the first:

    ConformanceEKS (ci-eks) #12345 ci-gke

That is probably okay since it will preserve most of the benefits of
concurrency groups without cancelling everything as soon as someone
posts a comment.

Fixes: 7e953b9 (".github: Cancel outdated comment workflows")
Signed-off-by: Paul Chaignon <paul@cilium.io>
qmonnet pushed a commit to qmonnet/cilium that referenced this pull request Jun 4, 2021
[ upstream commit c569ead ]

Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced
concurrency groups for workflows triggered by comments. In each
concurrency group, a single workflow can be running at any time, with
previous workflows cancelled when more recent are scheduled.

However, in the context of comment-triggered workflows, a workflow is
triggered for every single comment in the pull request. The actual tests
on the other hand are only triggered for specific comments. But even if
those comments don't contain a phrase that triggers the test (e.g.,
test-me-please or ci-gke), they will cancel previously-running workflows.

To fix this, we need to ensure that the concurrency group with comments
that trigger tests does not include any comments which don't trigger
tests. We can achieve that by appending the actual comment text to the
concurrency group name.

So for example, a comment with "test-me-please" on PR 12345 will trigger
a workflow which belong to concurrency group:

    ConformanceEKS (ci-eks) cilium#12345 test-me-please

If GKE tests are then triggered with ci-gke, the new workflow will
belong to a second concurrency group and won't cancel the first:

    ConformanceEKS (ci-eks) cilium#12345 ci-gke

That is probably okay since it will preserve most of the benefits of
concurrency groups without cancelling everything as soon as someone
posts a comment.

Fixes: 7e953b9 (".github: Cancel outdated comment workflows")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
nathanjsweet pushed a commit that referenced this pull request Jun 4, 2021
[ upstream commit c569ead ]

Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced
concurrency groups for workflows triggered by comments. In each
concurrency group, a single workflow can be running at any time, with
previous workflows cancelled when more recent are scheduled.

However, in the context of comment-triggered workflows, a workflow is
triggered for every single comment in the pull request. The actual tests
on the other hand are only triggered for specific comments. But even if
those comments don't contain a phrase that triggers the test (e.g.,
test-me-please or ci-gke), they will cancel previously-running workflows.

To fix this, we need to ensure that the concurrency group with comments
that trigger tests does not include any comments which don't trigger
tests. We can achieve that by appending the actual comment text to the
concurrency group name.

So for example, a comment with "test-me-please" on PR 12345 will trigger
a workflow which belong to concurrency group:

    ConformanceEKS (ci-eks) #12345 test-me-please

If GKE tests are then triggered with ci-gke, the new workflow will
belong to a second concurrency group and won't cancel the first:

    ConformanceEKS (ci-eks) #12345 ci-gke

That is probably okay since it will preserve most of the benefits of
concurrency groups without cancelling everything as soon as someone
posts a comment.

Fixes: 7e953b9 (".github: Cancel outdated comment workflows")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/host-firewall Impacts the host firewall or the host endpoint. dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants