Network policies for the host endpoint #11507

pchaigno · 2020-05-12T20:57:48Z

This pull request adds network policies (CIDR, labels, and ports) for the host through the new host endpoint. It also introduces a new nodeSelector in our internal JSON policy.

As a summary:

Introduce a new --enable-host-firewall option.
Prepare BPF policy functions to accept a localID as argument.
Introduce host network policies on Go side.
Enforce IPv4 policies for the host on BPF side.
Enforce IPv6 policies for the host on BPF side.
Support egress LB.
Bypass host policies on paths from/to proxy.
Reduce BPF verifier complexity using relax_verifier() to improve state pruning.
Disable egress LB on older kernels if host policies are enabled to reduce complexity.

The last piece, to watch and update labels for the node, will come in a separate pull request. It's also dependent on the host endpoint PR, but is independent from the present PR.

Fixes #9915

coveralls · 2020-05-12T21:25:54Z

Coverage increased (+0.03%) to 37.035% when pulling 56f9713661a593579170d81addb5f232c731b878 on pr/pchaigno/host-policies into 13bcf96 on master.

joestringer

I made a first pass with mostly high-level feedback discussion points below, although I couldn't pick many nits anyway as it overall looks really clean!

I didn't look closely at the last 3 patches (ipv6 and the complexity/lb patches). Several of my points below are appropriate for separate followup later.

pkg/policy/api/rule_validation.go

pkg/policy/repository.go

bpf/bpf_host.c

pkg/policy/api/rule_validation.go

aanm

Only small nits, overall LGTM. I only reviewed the control plane sections

examples/policies/l3/host/host-policy.json

pkg/policy/api/rule.go

aanm · 2020-05-13T08:38:47Z

pkg/policy/api/rule.go

@@ -70,6 +83,63 @@ func NewRule() *Rule {
 	return &Rule{}
 }

+// MarshalJSON returns the JSON encoding of Rule r. We need to overwrite it to
+// enforce omitempty on the EndpointSelector nested structures.
+func (r *Rule) MarshalJSON() ([]byte, error) {


Discussed offline

Added a note to rework this in follow ups because it's less trivial than we thought. Using reflect, I might have a solution that doesn't require as many code changes and is easier on maintainability, but it still needs some work.

I just noticed that this PR introduced custom JSON marshaling for Rule. This has some implications for #11607. Specifically, one of the reasons I forked controller-tools is to remove a constraint which says: any type implementing custom JSON marshaling will have its validation schema replaced with type: Any. In the upstream, kubernetes-sigs/controller-tools#427 is responsible for this. In the fork, I've reverted that support.

My question is why do we need this? I asked on K8s Slack about opting out of this feature in controller-tools. It's possible that we can keep it with a knob in controller-tools, or we solve the problem another way.

We need this to properly implement the omitempty tags of NodeSelector and EndpointSelector. Because they are not pointers, without the custom marshal, marshalling then unmarshalling would create a new field where none existed. That is, when marshalling it would check r.EndpointSelector (always != nil) instead of r.EndpointSelector.LabelSelector and thus always create the corresponding JSON entry.

You can easily reproduce this by removing the custom marshalling and running unit tests on the package. One of the tests checks json.Unmarshal(json.Marhsal(rule)) == rule.

daemon/cmd/daemon_main.go

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the its host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the its host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the its host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit d20d905 ] When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit 576028d ] When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

[ upstream commit d20d905 ] When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 576028d ] When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit d20d905 ] When the host firewall and vxlan are enabled, we need to send traffic from pods to remote nodes through the tunnel to preserve the pods' security IDs. If we don't and masquerading is enabled, those packets will be SNATed and we will lose the source security ID. Traffic from pods is automatically sent through the tunnel when the tunnel_endpoint value in the ipcache is set. Thus, this commit ensures that value is set to the node's IP for all remote nodes. Before: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 0.0.0.0 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 After: $ sudo cilium bpf ipcache get 192.168.33.11 192.168.33.11 maps to identity 6 0 192.168.33.11 $ sudo cilium bpf ipcache get 192.168.33.12 192.168.33.12 maps to identity 1 0 0.0.0.0 I tested this change with the dev. VMs, vxlan and the host firewall enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1 was successfully sent through the tunnel to k8s2 and rejected by host policies at k8s2. Connections allowed by policies took the same path and were successfully established. Since the host firewall is enabled in all Jenkins' CIs, passing tests should also ensure this change does not break connectivity in other scenarios. When kube-proxy is enabled, this change makes the host firewall incompatible with externalTrafficPolicy=Local services and portmap chaining. These incompatibilities will require additional fixes. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 576028d ] When traffic from a pod is destined to the local host, on egress from the container, it is passed to the stack and doesn't go through the host device (e.g., cilium_host). This results in a host firewall bypass on ingress. To fix this, we redirect traffic egressing pods to the host device when the host firewall is enabled and the destination ID is that of the host. Fixes: #11507 Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno requested a review from a team May 12, 2020 20:57

pchaigno requested a review from a team as a code owner May 12, 2020 20:57

pchaigno requested a review from a team May 12, 2020 20:57

pchaigno requested review from a team as code owners May 12, 2020 20:57

pchaigno requested a review from a team May 12, 2020 20:57

This comment has been minimized.

Sign in to view

maintainer-s-little-helper bot added the dont-merge/needs-release-note label May 12, 2020

pchaigno added kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium. labels May 12, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note label May 12, 2020

joestringer requested changes May 13, 2020

View reviewed changes

pchaigno changed the base branch from master to pr/pchaigno/host-fw May 13, 2020 08:26

pchaigno added the dont-merge/blocked Another PR must be merged before this one. label May 13, 2020

aanm requested changes May 13, 2020

View reviewed changes

pchaigno force-pushed the pr/pchaigno/host-fw branch from bf8ca9f to 76c8466 Compare May 13, 2020 09:08

pchaigno force-pushed the pr/pchaigno/host-policies branch from cdcf227 to 5a9fd36 Compare May 13, 2020 09:12

pchaigno force-pushed the pr/pchaigno/host-fw branch from 76c8466 to b489494 Compare May 13, 2020 16:10

pchaigno force-pushed the pr/pchaigno/host-policies branch from 5a9fd36 to 980fee7 Compare May 13, 2020 16:10

pchaigno force-pushed the pr/pchaigno/host-fw branch 2 times, most recently from 11821ee to f8e8b7b Compare May 14, 2020 10:32

pchaigno force-pushed the pr/pchaigno/host-policies branch from 980fee7 to ee404f4 Compare May 14, 2020 10:33

pchaigno force-pushed the pr/pchaigno/host-fw branch from f8e8b7b to 1cc2f9c Compare May 14, 2020 11:05

pchaigno force-pushed the pr/pchaigno/host-policies branch 2 times, most recently from 455e5fe to 1e0cbec Compare May 14, 2020 14:17

pchaigno mentioned this pull request Jul 10, 2020

policy: Fix enforcement status of host when PolicyEnforcement=always #12497

Merged

pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020

pchaigno mentioned this pull request Sep 2, 2020

bpf: Fix host firewall in presence of kube-proxy masquerading #13049

Merged

jrajahalme mentioned this pull request May 24, 2022

Conntrack table reset on agent restart #19367

Closed

2 tasks

pchaigno mentioned this pull request Apr 20, 2023

bpf: Fix incorrect source identity for IPv6 host policies #25024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Network policies for the host endpoint #11507

Network policies for the host endpoint #11507

Uh oh!

pchaigno commented May 12, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

coveralls commented May 12, 2020 •

edited

Loading

Uh oh!

joestringer left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aanm left a comment

Uh oh!

Uh oh!

Uh oh!

aanm May 13, 2020

Uh oh!

pchaigno May 15, 2020

Uh oh!

christarazi Jul 31, 2020 •

edited

Loading

Uh oh!

pchaigno Aug 4, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Network policies for the host endpoint #11507

Network policies for the host endpoint #11507

Uh oh!

Conversation

pchaigno commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

coveralls commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aanm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aanm May 13, 2020

Choose a reason for hiding this comment

Uh oh!

pchaigno May 15, 2020

Choose a reason for hiding this comment

Uh oh!

christarazi Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchaigno Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pchaigno commented May 12, 2020 •

edited

Loading

coveralls commented May 12, 2020 •

edited

Loading

joestringer left a comment •

edited

Loading

christarazi Jul 31, 2020 •

edited

Loading