Host firewall fixes #12345

pchaigno · 2020-06-30T20:26:42Z

This pull request fixes various host firewall bugs. Please see the release notes and review commit by commit.

How I Tested This

I'm still tracking some bugs in the host firewall, so seems important to highlight what I tested and is known to work.

I tested the host firewall with the ingress+egress L3+L4 example policy included at the end of this series. It attempts to whitelist connections in the most fine-grained way possible. Specifically, I launched the connectivity-test and loaded that policy in the net-next development VMs, with options kube-proxy-replacement=strict, enable-remote-node-identity, enable-host-firewall, and the devices set.

I then checked that the connectivity check was okay for 20min+ and verified all packet drops were expected.

Impact on CI

The host firewall is enabled in our Packet K8s CI builds. The main reason to enable it was to be able to detect any regression due to BPF complexity issues (it adds a lot of code to the datapath). That was fine because having the host firewall enabled with allow-all default shouldn't impact anything in Cilium (apart from dropping packets if there's a bug).

However, with the present fixes, enabling the host firewall now changes the path of some packets to force them through the firewall's ingress. That means we are really not testing the default setup (i.e., host firewall disabled) the users are likely to have.

I will therefore send a follow up PR to only enable the host firewall in CI if we have some label set in the PR (as we do for other features).

Updates: #11799

Fix incorrect host firewall enforcement when used with BPF-based NodePort services
Fix host firewall ingress bypass on path from pods to local host
Fix potential ingress host firewall bypass in tunneling mode for remote pods
Fix handling of ICMPv6 messages by host firewall
Fix failure to recognize established IPv6 connections on egress of the host firewall
Fix potential host firewall drops on egress of the node in case of SNAT

qmonnet

Aside from a few minor nits, the PR looks good to me. Thanks for this work!

bpf/bpf_host.c

bpf/lib/icmp6.h

test/k8sT/Services.go

nebril

CI bits lgtm

brb · 2020-07-08T13:28:19Z

Re We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend from 69c4efe: please correct me if I'm wrong, but I think we might still miss this case, as in the case of a remote backend, CILIUM_CALL_IPV{4,6}_NODEPORT_NAT (taill-called by nodeport_lb{4,6}()) will not recircle back to bpf_host and will invoke ctx_redirect().

pchaigno · 2020-07-08T17:50:29Z

Re We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend from 69c4efe: please correct me if I'm wrong, but I think we might still miss this case, as in the case of a remote backend, CILIUM_CALL_IPV{4,6}_NODEPORT_NAT (taill-called by nodeport_lb{4,6}()) will not recircle back to bpf_host and will invoke ctx_redirect().

Ah, right. So if the backend is remote, I can just skip the ingress host policy enforcement. Is it correct to assume that if CILIUM_CALL_IPV{4,6}_NODEPORT_NAT is called with NAT_DIR_EGRESS, then the backend is remote? If that's always true, I could maybe set a mark to skip the host firewall (I'm guessing ctx->tc_index isn't preserved on redirects?).

brb · 2020-07-09T12:41:41Z

Is it correct to assume that if CILIUM_CALL_IPV{4,6}_NODEPORT_NAT is called with NAT_DIR_EGRESS, then the backend is remote?

Yes:

NAT_DIR_EGRESS - a request to a remote backend.
NAT_DIR_INGRESS - a reply from a remote backend + other packets.

The host firewall functions will be called from nodeport code in next commit. A few minor changes were also made in the code moved to please checkpatch. Signed-off-by: Paul Chaignon <paul@cilium.io>

In handle_ipv{4,6}, we implement several operations: 1. Handle NodePort services 2. (Handle ICMPv6 messages) 3. Swap MAC addresses 4. Enforce host policies However, NodePort services do not process ICMPv6 messages so we can just process those first. In addition, it is useless to swap MAC addresses if the packet is going to be dropped by host policies next. Thus, this commit reorders these operations as follows. 1. (Handle ICMPv6 messages) 2. Handle NodePort services 3. Enforce host policies 4. Swap MAC addresses Signed-off-by: Paul Chaignon <paul@cilium.io>

We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

The copy of the source IPv6 address into the tuple used for conntrack lookup was dropped during a previous rebase. This bug can cause connections to fail when egress host policies are enforced in IPv6. Signed-off-by: Paul Chaignon <paul@cilium.io>

For ICMPv6 messages we allow, we need to bypass the host firewall. We cannot simply pass these messages up the stack because they are not destined to the host. I rebased 5244b68 not long before it was merged and likely introduced the bug at that point. I validated the rebase with an ingress host policy instead of an egress, and missed the regression. This time I validated with an extensive ingress+egress L3L4 host policy included at the end of this series. Fixes: 5244b68 ("bpf: Hande icmpv6 in host firewall") Signed-off-by: Paul Chaignon <paul@cilium.io>

Packets egressing on the native device (i.e., to-netdev in bpf_host) may be SNATed and therefore have the source IP address of the host device. We need to take that into account when resolving the srcID to avoid matching host policies on packets that egress from pods. To that end, we rely on the same MARK_MAGIC_HOST mark (marking packets as they egress the host namespace) as for packets egressing the host namespace on the host device. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Signed-off-by: Paul Chaignon <paul@cilium.io>

This policy can be used to test the host firewall. It has been tested on the development VMs with kube-proxy-replacement=strict, enable-remote-node-identity, and the devices configured. It attempts to define the most fine-grained possible policies (L3+L4) on both ingress and egress such that the connectivity-check works. The L3-only rules are there to ensure ICMP echo messages between nodes and with the health endpoint are not dropped. The connectivity-check still works if these rules are removed. This policy shouldn't be used outside of the dev. VM context as (1) it is likely more extensive than necessary and (2) some adjustments would be necessary to prevent breakages. Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2020-07-16T15:02:49Z

test-me-please

joestringer · 2020-07-16T18:22:23Z

BPF checks / GKE are the only failures, BPF checks is false positive and GKE is not required. Merging.

[ upstream commit 9654cb4 ] The host firewall currently only works if remote node identities are distinct from the host identity (i.e., REMOTE_NODE_ID is supported). It's not clear if this restriction can be removed in the future, especially once #12345 is merged. In particular, #12345 will redirect packets from local endpoints through the host device if the destination security ID is HOST_ID. If the HOST_ID includes remote nodes, then this redirection is likely to result in packet drops. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) cilium#12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) cilium#12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io>

Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) #12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) #12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit c569ead ] Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) cilium#12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) cilium#12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>

[ upstream commit c569ead ] Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) #12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) #12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>

pchaigno added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jun 30, 2020

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 5382653 to 1c86738 Compare June 30, 2020 20:27

pchaigno added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. needs-backport/1.8 labels Jun 30, 2020

This comment has been minimized.

Sign in to view

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 1c86738 to 7e0d055 Compare July 1, 2020 10:54

pchaigno added the kind/bug This is a bug in the Cilium logic. label Jul 1, 2020

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch 4 times, most recently from 5646508 to b603d50 Compare July 2, 2020 06:53

pchaigno mentioned this pull request Jul 2, 2020

Followups for host endpoint and firewall #11799

Closed

34 tasks

pchaigno marked this pull request as ready for review July 2, 2020 09:01

pchaigno requested a review from a team July 2, 2020 09:01

pchaigno requested review from a team as code owners July 2, 2020 09:01

qmonnet reviewed Jul 2, 2020

View reviewed changes

bpf/bpf_host.c Outdated Show resolved Hide resolved

bpf/lib/icmp6.h Outdated Show resolved Hide resolved

bpf/lib/icmp6.h Outdated Show resolved Hide resolved

test/k8sT/Services.go Show resolved Hide resolved

nebril approved these changes Jul 2, 2020

View reviewed changes

aanm approved these changes Jul 3, 2020

View reviewed changes

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from b603d50 to c4bf244 Compare July 6, 2020 06:24

pchaigno mentioned this pull request Jul 6, 2020

bpf: Apply host firewall before NodePort SNATing #12011

Closed

pchaigno requested a review from qmonnet July 6, 2020 06:29

qmonnet approved these changes Jul 6, 2020

View reviewed changes

pchaigno mentioned this pull request Jul 6, 2020

node: Encapsulate all traffic to remote nodes #12293

Closed

pchaigno requested a review from a team July 8, 2020 10:17

pchaigno requested a review from a team as a code owner July 15, 2020 20:50

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from d430a28 to 0425a2c Compare July 15, 2020 20:54

pchaigno added 9 commits July 16, 2020 16:45

bpf: Move host firewall code to header file

7c7b8fc

The host firewall functions will be called from nodeport code in next commit. A few minor changes were also made in the code moved to please checkpatch. Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno force-pushed the pr/pchaigno/host-firewall-fixes branch from 0425a2c to 82c746c Compare July 16, 2020 14:46

joestringer merged commit f55ec90 into master Jul 16, 2020

joestringer deleted the pr/pchaigno/host-firewall-fixes branch July 16, 2020 18:22

pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020

pchaigno mentioned this pull request Jul 21, 2020

v1.8 backports 2020-07-21 #12600

Merged

pchaigno added backport-pending/1.8 and removed needs-backport/1.8 labels Jul 21, 2020

pchaigno mentioned this pull request Jul 25, 2020

Host firewall tests #12621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Host firewall fixes #12345

Host firewall fixes #12345

Uh oh!

pchaigno commented Jun 30, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

qmonnet left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nebril left a comment

Uh oh!

brb commented Jul 8, 2020

Uh oh!

pchaigno commented Jul 8, 2020

Uh oh!

brb commented Jul 9, 2020

Uh oh!

pchaigno commented Jul 16, 2020

Uh oh!

joestringer commented Jul 16, 2020

Uh oh!

Uh oh!

Host firewall fixes #12345

Host firewall fixes #12345

Uh oh!

Conversation

pchaigno commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How I Tested This

Impact on CI

Uh oh!

This comment has been minimized.

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nebril left a comment

Choose a reason for hiding this comment

Uh oh!

brb commented Jul 8, 2020

Uh oh!

pchaigno commented Jul 8, 2020

Uh oh!

brb commented Jul 9, 2020

Uh oh!

pchaigno commented Jul 16, 2020

Uh oh!

joestringer commented Jul 16, 2020

Uh oh!

Uh oh!

pchaigno commented Jun 30, 2020 •

edited

Loading