Skip to content

High-Scale: Modifying the ns label may cause the apiserver to crash. #38030

@orange30

Description

@orange30

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.17.1 and lower than v1.18.0

What happened?

In a cluster of nearly 3,000 nodes, when we modified the label of a namespace(there is about 20000 pod and 5000+ ciliumidentity in this namespace), a large amount of CiliumIdentity and CIliumEndpoint traffic was generated instantly, causing the apiserver to crash.

  1. Currently, in clusters where Cilium is deployed, changing the namespace label will instantly generate a number of ciliumidentity when the namespace have many pod who's label is different.
  2. After these ciliumidentity events are pushed to the API server, they are fully distributed to each node.
  3. These ciliumidentity changes will result in a large number of ciliumendpoint update events.
    Therefore, when there are many pods who's lable is different in a namespace and the cluster has a large number of nodes, changing the namespace label can easily cause significant pressure on the API server and, under extreme circumstances, may lead to the API server crashing.

How can we reproduce the issue?

  1. The cluster has as many nodes as possible.
  2. Place as many different label pods as possible in a namespace.
    When modifying the namespace label, the impact of traffic on the apiserver is proportional to the results of the above two factors.

Cilium Version

We use v1.13.11, but all the version have the same problem.

Kernel Version

5.10

Kubernetes Version

v1.30

Regression

No response

Sysdump

No response

Relevant log output

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.sig/policyImpacts whether traffic is allowed or denied based on user-defined policies.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions