policy: consistent enablement in agent and operator #36167

dlapcevic · 2024-11-25T16:32:47Z

The network policy enforcement system for K8s, Cilium and Cilium Clusterwide network policies can be disabled when not used, to reduce resource footprint of Cilium, and improve scalability and performance.

Conditions for policy disabled:

EnablePolicy=never
DisableCiliumEndpointCRD=true
EnableK8sNetworkPolicy=false
EnableCiliumNetworkPolicy=false
EnableCiliumClusterwideNetworkPolicy=false
IdentityAllocationMode=crd

Important note
When enabling, or re-enabling, network policies (going from disabled to enabled configuration), to avoid any traffic disruption, ensure to follow one of these two-step processes:

First enable network policy configuration and update / restart the cilium-agent pods, before creating any network policies in the cluster.
If network policies are already created in the cluster, avoid enabling all the configuration at once. First change DisableCiliumEndpointCRD to false and update / restart cilium-agent pods, before applying the rest of the configuration. Do not change EnablePolicy and EnableK8sNetworkPolicy, EnableCiliumNetworkPolicy or EnableCiliumNetworkPolicy before first changing DisableCiliumEndpointCRD to false and updating / restarting cilium-agent pods.

This is especially important for larger clusters that can take longer time to update cilium-agent daemonset and reconcile the system. The reason is that the network policy enforcement system on cilium-agent requires Cilium Endpoint CRD to enforce policies for pods on remote nodes. Therefore, if network policies and Cilium Endpoint CRD are enabled at the same time, some traffic between the nodes can be disrupted (dropped), because rolling update / restart of cilium-agent takes time and remote node Cilium Endpoint might not yet been created when on the local node the policy that requires it is already being enforced.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

joestringer · 2024-11-26T01:25:14Z

Could you move the detailed explanation for safe upgrade into the docs? The release notes in the PR description are expected to be formatted as a single-line notification for users that they can then click to learn more from the PR. Consider that the text from the release note in the PR description ends up as just one bulletpoint in a list like the one here: https://github.com/cilium/cilium/releases/tag/v1.17.0-pre.2

One of the reasons I suggest this is that the tooling doesn't properly format multiline release notes so it'll likely just render a bit broken unless the release manager catches & fixes it. The other reason is that the release notes are generally a bit less accessible than having a dedicated writeup in the main docs pages.

EDIT: Actually on second thought I'm not sure we want to be documenting this for users. There are too many footguns with this configuration and not enough tests, see also my comments here. If someone is a developer and follow a PR like this to figure out how to make this configuration work for them, then great. What I don't want is for mainstream Cilium users to see the release note, think this is some great optimization, enable it and then break a bunch of common functionality and come to Slack asking why everything is broken. Maybe let's avoid documenting this for users via release note or docs, and just leave that content as regular text in the PR description?

tamilmani1989 · 2024-11-26T03:10:05Z

Are these settings apply only for CRD based mode ? Can users enable network policy with etcd as backend and still keep DisableCiliumEndpointCRD to true?

pkg/option/features.go

bimmlerd

few nits

operator/pkg/ciliumidentity/controller.go

pkg/option/features.go

operator/pkg/ciliumidentity/cell.go

dlapcevic · 2024-11-26T11:25:39Z

Response to comment (#36167 (comment))

EDIT: Actually on second thought I'm not sure we want to be documenting this for users.

Yes, it would be safer to do so. I updated the PR. We will not advertise this specific configuration to all the users, but rather those who come with the interest of scaling Cilium for other features, beyond the scale that network policy supports, can learn about it.

dlapcevic · 2024-11-26T11:34:18Z

Response to comment (#36167 (comment))

Are these settings apply only for CRD based mode ? Can users enable network policy with etcd as backend and still keep DisableCiliumEndpointCRD to true?

Currently, it would work only with CRD based mode, because the path is simpler for it, since IPCache is populated via the CRDs that are disabled based on existing configuration.

If there is interest in expanding this to etcd backend, we need to ensure that we have configuration that will in a similar way set up the system to get the same benefits (stop syncing IPCache for all pods to all nodes).

pippolo84

Thanks! Left a comment about operator options binding.

pkg/option/features.go

The network policy enforcement system for K8s, Cilium and Cilium Clusterwide network policies can be disabled when not used, to reduce resource footprint of Cilium, and improve scalability and performance. Conditions for policy disabled: - EnablePolicy=never - DisableCiliumEndpointCRD=true - EnableK8sNetworkPolicy=false - EnableCiliumNetworkPolicy=false - EnableCiliumClusterwideNetworkPolicy=false - IdentityAllocationMode=crd **Important note** When enabling, or re-enabling, network policies (going from disabled to enabled configuration), to avoid any traffic disruption, ensure to follow one of these two-step processes: - First enable network policy configuration and update / restart the cilium-agent pods, before creating any network policies in the cluster. - If network policies are already created in the cluster, avoid enabling all the configuration at once. First change DisableCiliumEndpointCRD to false and update / restart cilium-agent pods, before applying the rest of the configuration. Do not change EnablePolicy and EnableK8sNetworkPolicy, EnableCiliumNetworkPolicy or EnableCiliumNetworkPolicy before first changing DisableCiliumEndpointCRD to false and updating / restarting cilium-agent pods. This is especially important for larger clusters that can take longer time to update cilium-agent daemonset and reconcile the system. The reason is that the network policy enforcement system on cilium-agent requires Cilium Endpoint CRD to enforce policies for pods on remote nodes. Therefore, if network policies and Cilium Endpoint CRD are enabled at the same time, some traffic between the nodes can be disrupted (dropped), because rolling update / restart of cilium-agent takes time and remote node Cilium Endpoint might not yet been created when on the local node the policy that requires it is already being enforced. Signed-off-by: Dorde Lapcevic <dordel@google.com>

dlapcevic · 2024-11-28T11:38:29Z

/test

pippolo84

🚀

bimmlerd

code looks ok to me. I have no idea whether this is a direction the project wants to take, but I'll happily defer to a maintainer.

joestringer · 2024-12-03T20:39:35Z

I have no idea whether this is a direction the project wants to take, but I'll happily defer to a maintainer.

Just to chime in a bit here from one perspective: There is a developing set of use cases from some heavy users of Cilium who may choose to implement network policy in ways external to Cilium, so we're exploring how to enable those deployments to be more resource efficient. Right now I'm relying on folks like @dlapcevic and @tamilmani1989 to explore these options and help maintain Cilium in this configuration, and I consider it to be a power user configuration that we avoid explicitly documenting. I don't personally have a good picture of exactly how we will ensure that the core of Cilium continues to be maintainable with this mode, but if we don't try then we won't build that knowledge / expertise. However if folks are using this mode, I'm expecting them to be heavily involved in development of this mode to actively improve Cilium with these configuration parameters. This is a bit of a cautious approach but it seems like a reasonable middle ground for now.

dlapcevic requested review from a team as code owners November 25, 2024 16:32

dlapcevic requested review from bimmlerd and derailed November 25, 2024 16:32

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 25, 2024

dlapcevic requested a review from pippolo84 November 25, 2024 16:32

github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Nov 25, 2024

dlapcevic added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Nov 25, 2024

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 25, 2024

joestringer added the upgrade-impact This PR has potential upgrade or downgrade impact. label Nov 26, 2024

tklauser reviewed Nov 26, 2024

View reviewed changes

pkg/option/features.go Outdated Show resolved Hide resolved

bimmlerd requested changes Nov 26, 2024

View reviewed changes

operator/pkg/ciliumidentity/controller.go Outdated Show resolved Hide resolved

pkg/option/features.go Outdated Show resolved Hide resolved

operator/pkg/ciliumidentity/cell.go Outdated Show resolved Hide resolved

dlapcevic mentioned this pull request Nov 26, 2024

policy: Use no-op ID allocator when policy is disabled #36102

Merged

dlapcevic force-pushed the np-1 branch from 7824049 to 6af36c5 Compare November 26, 2024 11:19

dlapcevic added release-note/misc This PR makes changes that have no direct user impact. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Nov 26, 2024

dlapcevic requested review from tklauser and bimmlerd November 26, 2024 11:23

pippolo84 requested changes Nov 27, 2024

View reviewed changes

pkg/option/features.go Show resolved Hide resolved

dlapcevic force-pushed the np-1 branch from 6af36c5 to 9dcffa6 Compare November 28, 2024 10:04

dlapcevic requested a review from pippolo84 November 28, 2024 10:13

dlapcevic force-pushed the np-1 branch from 9dcffa6 to 7c70c1d Compare November 28, 2024 10:26

dlapcevic force-pushed the np-1 branch from 7c70c1d to ae3f904 Compare November 28, 2024 11:12

pippolo84 approved these changes Nov 28, 2024

View reviewed changes

bimmlerd approved these changes Nov 29, 2024

View reviewed changes

tklauser approved these changes Nov 29, 2024

View reviewed changes

dlapcevic added this pull request to the merge queue Nov 29, 2024

Merged via the queue into cilium:main with commit 19fe642 Nov 29, 2024
64 checks passed

dlapcevic deleted the np-1 branch November 29, 2024 17:37

jshr-w mentioned this pull request Dec 3, 2024

fix: set netpol disablement values before disabling CEP #36339

Merged

gandro mentioned this pull request Dec 12, 2024

k8s/watchers: only watch local pods if network policies disabled #36528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

policy: consistent enablement in agent and operator #36167

policy: consistent enablement in agent and operator #36167

Uh oh!

dlapcevic commented Nov 25, 2024 •

edited

Loading

Uh oh!

joestringer commented Nov 26, 2024 •

edited

Loading

Uh oh!

tamilmani1989 commented Nov 26, 2024

Uh oh!

Uh oh!

bimmlerd left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlapcevic commented Nov 26, 2024

Uh oh!

dlapcevic commented Nov 26, 2024

Uh oh!

pippolo84 left a comment

Uh oh!

Uh oh!

dlapcevic commented Nov 28, 2024

Uh oh!

pippolo84 left a comment

Uh oh!

bimmlerd left a comment

Uh oh!

Uh oh!

joestringer commented Dec 3, 2024

Uh oh!

Uh oh!

policy: consistent enablement in agent and operator #36167

policy: consistent enablement in agent and operator #36167

Uh oh!

Conversation

dlapcevic commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tamilmani1989 commented Nov 26, 2024

Uh oh!

Uh oh!

bimmlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlapcevic commented Nov 26, 2024

Uh oh!

dlapcevic commented Nov 26, 2024

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dlapcevic commented Nov 28, 2024

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

bimmlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joestringer commented Dec 3, 2024

Uh oh!

Uh oh!

dlapcevic commented Nov 25, 2024 •

edited

Loading

joestringer commented Nov 26, 2024 •

edited

Loading