-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Host firewall fixes #12345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host firewall fixes #12345
Conversation
5382653
to
1c86738
Compare
This comment has been minimized.
This comment has been minimized.
1c86738
to
7e0d055
Compare
5646508
to
b603d50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from a few minor nits, the PR looks good to me. Thanks for this work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI bits lgtm
b603d50
to
c4bf244
Compare
Re |
Ah, right. So if the backend is remote, I can just skip the ingress host policy enforcement. Is it correct to assume that if |
Yes:
|
d430a28
to
0425a2c
Compare
The host firewall functions will be called from nodeport code in next commit. A few minor changes were also made in the code moved to please checkpatch. Signed-off-by: Paul Chaignon <paul@cilium.io>
In handle_ipv{4,6}, we implement several operations: 1. Handle NodePort services 2. (Handle ICMPv6 messages) 3. Swap MAC addresses 4. Enforce host policies However, NodePort services do not process ICMPv6 messages so we can just process those first. In addition, it is useless to swap MAC addresses if the packet is going to be dropped by host policies next. Thus, this commit reorders these operations as follows. 1. (Handle ICMPv6 messages) 2. Handle NodePort services 3. Enforce host policies 4. Swap MAC addresses Signed-off-by: Paul Chaignon <paul@cilium.io>
We want the host firewall to see ingress-to-host packets (1) after the DNAT but before the SNAT on the way to the remote backend, and (2) after the reverse SNAT but before the reverse DNAT for replies from remote backend. This way, we always see packets with the backend pod's IP and the client's IP. Packets on their way to a remote backend will be redirect by nodeport_lb{4,6} to the egress of the native device (towards the remote node). Egress host policies will be bypassed for these packets, as they should, because they won't have the mark indicating they come from the host namespace. See subsequent commit in this serie where we introduce the check on the mark in to-netdev. See #12011 for details and discussion on this change. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Related: #12011 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>
When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>
The copy of the source IPv6 address into the tuple used for conntrack lookup was dropped during a previous rebase. This bug can cause connections to fail when egress host policies are enforced in IPv6. Signed-off-by: Paul Chaignon <paul@cilium.io>
For ICMPv6 messages we allow, we need to bypass the host firewall. We cannot simply pass these messages up the stack because they are not destined to the host. I rebased 5244b68 not long before it was merged and likely introduced the bug at that point. I validated the rebase with an ingress host policy instead of an egress, and missed the regression. This time I validated with an extensive ingress+egress L3L4 host policy included at the end of this series. Fixes: 5244b68 ("bpf: Hande icmpv6 in host firewall") Signed-off-by: Paul Chaignon <paul@cilium.io>
Packets egressing on the native device (i.e., to-netdev in bpf_host) may be SNATed and therefore have the source IP address of the host device. We need to take that into account when resolving the srcID to avoid matching host policies on packets that egress from pods. To that end, we rely on the same MARK_MAGIC_HOST mark (marking packets as they egress the host namespace) as for packets egressing the host namespace on the host device. Fixes: 88bf291 ("bpf: Enforce host policies for IPv4") Signed-off-by: Paul Chaignon <paul@cilium.io>
This policy can be used to test the host firewall. It has been tested on the development VMs with kube-proxy-replacement=strict, enable-remote-node-identity, and the devices configured. It attempts to define the most fine-grained possible policies (L3+L4) on both ingress and egress such that the connectivity-check works. The L3-only rules are there to ensure ICMP echo messages between nodes and with the health endpoint are not dropped. The connectivity-check still works if these rules are removed. This policy shouldn't be used outside of the dev. VM context as (1) it is likely more extensive than necessary and (2) some adjustments would be necessary to prevent breakages. Signed-off-by: Paul Chaignon <paul@cilium.io>
0425a2c
to
82c746c
Compare
test-me-please |
BPF checks / GKE are the only failures, BPF checks is false positive and GKE is not required. Merging. |
[ upstream commit 9654cb4 ] The host firewall currently only works if remote node identities are distinct from the host identity (i.e., REMOTE_NODE_ID is supported). It's not clear if this restriction can be removed in the future, especially once #12345 is merged. In particular, #12345 will redirect packets from local endpoints through the host device if the destination security ID is HOST_ID. If the HOST_ID includes remote nodes, then this redirection is likely to result in packet drops. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) cilium#12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) cilium#12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io>
Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) #12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) #12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c569ead ] Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) cilium#12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) cilium#12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit c569ead ] Commit 7e953b9 (".github: Cancel outdated comment workflows") introduced concurrency groups for workflows triggered by comments. In each concurrency group, a single workflow can be running at any time, with previous workflows cancelled when more recent are scheduled. However, in the context of comment-triggered workflows, a workflow is triggered for every single comment in the pull request. The actual tests on the other hand are only triggered for specific comments. But even if those comments don't contain a phrase that triggers the test (e.g., test-me-please or ci-gke), they will cancel previously-running workflows. To fix this, we need to ensure that the concurrency group with comments that trigger tests does not include any comments which don't trigger tests. We can achieve that by appending the actual comment text to the concurrency group name. So for example, a comment with "test-me-please" on PR 12345 will trigger a workflow which belong to concurrency group: ConformanceEKS (ci-eks) #12345 test-me-please If GKE tests are then triggered with ci-gke, the new workflow will belong to a second concurrency group and won't cancel the first: ConformanceEKS (ci-eks) #12345 ci-gke That is probably okay since it will preserve most of the benefits of concurrency groups without cancelling everything as soon as someone posts a comment. Fixes: 7e953b9 (".github: Cancel outdated comment workflows") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
This pull request fixes various host firewall bugs. Please see the release notes and review commit by commit.
How I Tested This
I'm still tracking some bugs in the host firewall, so seems important to highlight what I tested and is known to work.
I tested the host firewall with the ingress+egress L3+L4 example policy included at the end of this series. It attempts to whitelist connections in the most fine-grained way possible. Specifically, I launched the connectivity-test and loaded that policy in the net-next development VMs, with options
kube-proxy-replacement=strict
,enable-remote-node-identity
,enable-host-firewall
, and the devices set.I then checked that the connectivity check was okay for 20min+ and verified all packet drops were expected.
Impact on CI
The host firewall is enabled in our Packet K8s CI builds. The main reason to enable it was to be able to detect any regression due to BPF complexity issues (it adds a lot of code to the datapath). That was fine because having the host firewall enabled with allow-all default shouldn't impact anything in Cilium (apart from dropping packets if there's a bug).
However, with the present fixes, enabling the host firewall now changes the path of some packets to force them through the firewall's ingress. That means we are really not testing the default setup (i.e., host firewall disabled) the users are likely to have.
I will therefore send a follow up PR to only enable the host firewall in CI if we have some label set in the PR (as we do for other features).
Updates: #11799