Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented May 12, 2020

This pull request adds network policies (CIDR, labels, and ports) for the host through the new host endpoint. It also introduces a new nodeSelector in our internal JSON policy.

As a summary:

  1. Introduce a new --enable-host-firewall option.
  2. Prepare BPF policy functions to accept a localID as argument.
  3. Introduce host network policies on Go side.
  4. Enforce IPv4 policies for the host on BPF side.
  5. Enforce IPv6 policies for the host on BPF side.
  6. Support egress LB.
  7. Bypass host policies on paths from/to proxy.
  8. Reduce BPF verifier complexity using relax_verifier() to improve state pruning.
  9. Disable egress LB on older kernels if host policies are enabled to reduce complexity.

The last piece, to watch and update labels for the node, will come in a separate pull request. It's also dependent on the host endpoint PR, but is independent from the present PR.

Fixes #9915

@pchaigno pchaigno requested a review from a team May 12, 2020 20:57
@pchaigno pchaigno requested a review from a team as a code owner May 12, 2020 20:57
@pchaigno pchaigno requested a review from a team May 12, 2020 20:57
@pchaigno pchaigno requested review from a team as code owners May 12, 2020 20:57
@pchaigno pchaigno requested a review from a team May 12, 2020 20:57
@maintainer-s-little-helper

This comment has been minimized.

@pchaigno pchaigno added kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium. labels May 12, 2020
@coveralls
Copy link

coveralls commented May 12, 2020

Coverage Status

Coverage increased (+0.03%) to 37.035% when pulling 56f9713661a593579170d81addb5f232c731b878 on pr/pchaigno/host-policies into 13bcf96 on master.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first pass with mostly high-level feedback discussion points below, although I couldn't pick many nits anyway as it overall looks really clean!

I didn't look closely at the last 3 patches (ipv6 and the complexity/lb patches). Several of my points below are appropriate for separate followup later.

@pchaigno pchaigno changed the base branch from master to pr/pchaigno/host-fw May 13, 2020 08:26
@pchaigno pchaigno added the dont-merge/blocked Another PR must be merged before this one. label May 13, 2020
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small nits, overall LGTM. I only reviewed the control plane sections

@@ -70,6 +83,63 @@ func NewRule() *Rule {
return &Rule{}
}

// MarshalJSON returns the JSON encoding of Rule r. We need to overwrite it to
// enforce omitempty on the EndpointSelector nested structures.
func (r *Rule) MarshalJSON() ([]byte, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to rework this in follow ups because it's less trivial than we thought. Using reflect, I might have a solution that doesn't require as many code changes and is easier on maintainability, but it still needs some work.

Copy link
Member

@christarazi christarazi Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this PR introduced custom JSON marshaling for Rule. This has some implications for #11607. Specifically, one of the reasons I forked controller-tools is to remove a constraint which says: any type implementing custom JSON marshaling will have its validation schema replaced with type: Any. In the upstream, kubernetes-sigs/controller-tools#427 is responsible for this. In the fork, I've reverted that support.

My question is why do we need this? I asked on K8s Slack about opting out of this feature in controller-tools. It's possible that we can keep it with a knob in controller-tools, or we solve the problem another way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to properly implement the omitempty tags of NodeSelector and EndpointSelector. Because they are not pointers, without the custom marshal, marshalling then unmarshalling would create a new field where none existed. That is, when marshalling it would check r.EndpointSelector (always != nil) instead of r.EndpointSelector.LabelSelector and thus always create the corresponding JSON entry.

You can easily reproduce this by removing the custom marshalling and running unit tests on the package. One of the tests checks json.Unmarshal(json.Marhsal(rule)) == rule.

@pchaigno pchaigno force-pushed the pr/pchaigno/host-fw branch from bf8ca9f to 76c8466 Compare May 13, 2020 09:08
@pchaigno pchaigno force-pushed the pr/pchaigno/host-policies branch from cdcf227 to 5a9fd36 Compare May 13, 2020 09:12
@pchaigno pchaigno force-pushed the pr/pchaigno/host-fw branch from 76c8466 to b489494 Compare May 13, 2020 16:10
@pchaigno pchaigno force-pushed the pr/pchaigno/host-policies branch from 5a9fd36 to 980fee7 Compare May 13, 2020 16:10
@pchaigno pchaigno force-pushed the pr/pchaigno/host-fw branch 2 times, most recently from 11821ee to f8e8b7b Compare May 14, 2020 10:32
@pchaigno pchaigno force-pushed the pr/pchaigno/host-policies branch from 980fee7 to ee404f4 Compare May 14, 2020 10:33
@pchaigno pchaigno force-pushed the pr/pchaigno/host-fw branch from f8e8b7b to 1cc2f9c Compare May 14, 2020 11:05
@pchaigno pchaigno force-pushed the pr/pchaigno/host-policies branch 2 times, most recently from 455e5fe to 1e0cbec Compare May 14, 2020 14:17
pchaigno added a commit that referenced this pull request Jul 1, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 1, 2020
When traffic from a pod is destined to the its host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 2, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 2, 2020
When traffic from a pod is destined to the its host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 6, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 6, 2020
When traffic from a pod is destined to the its host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 15, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 15, 2020
When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 16, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 16, 2020
When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
[ upstream commit d20d905 ]

When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
joestringer pushed a commit that referenced this pull request Jul 16, 2020
[ upstream commit 576028d ]

When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
@pchaigno pchaigno added the area/host-firewall Impacts the host firewall or the host endpoint. label Jul 20, 2020
pchaigno added a commit that referenced this pull request Jul 21, 2020
[ upstream commit d20d905 ]

When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this pull request Jul 21, 2020
[ upstream commit 576028d ]

When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
rolinh pushed a commit that referenced this pull request Jul 21, 2020
[ upstream commit d20d905 ]

When the host firewall and vxlan are enabled, we need to send traffic from
pods to remote nodes through the tunnel to preserve the pods' security
IDs. If we don't and masquerading is enabled, those packets will be
SNATed and we will lose the source security ID.

Traffic from pods is automatically sent through the tunnel when the
tunnel_endpoint value in the ipcache is set. Thus, this commit ensures
that value is set to the node's IP for all remote nodes.

Before:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 0.0.0.0
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

After:

    $ sudo cilium bpf ipcache get 192.168.33.11
    192.168.33.11 maps to identity 6 0 192.168.33.11
    $ sudo cilium bpf ipcache get 192.168.33.12
    192.168.33.12 maps to identity 1 0 0.0.0.0

I tested this change with the dev. VMs, vxlan and the host firewall
enabled, and a host-level L4 policy loaded. Traffic from a pod on the k8s1
was successfully sent through the tunnel to k8s2 and rejected by host
policies at k8s2. Connections allowed by policies took the same path and
were successfully established.
Since the host firewall is enabled in all Jenkins' CIs, passing tests
should also ensure this change does not break connectivity in other
scenarios.

When kube-proxy is enabled, this change makes the host firewall
incompatible with externalTrafficPolicy=Local services and portmap
chaining. These incompatibilities will require additional fixes.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
rolinh pushed a commit that referenced this pull request Jul 21, 2020
[ upstream commit 576028d ]

When traffic from a pod is destined to the local host, on egress from the
container, it is passed to the stack and doesn't go through the host
device (e.g., cilium_host). This results in a host firewall bypass on
ingress.

To fix this, we redirect traffic egressing pods to the host device when
the host firewall is enabled and the destination ID is that of the host.

Fixes: #11507
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/host-firewall Impacts the host firewall or the host endpoint. kind/feature This introduces new functionality. release-note/major This PR introduces major new functionality to Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Host-level network security protection (datapath)
9 participants