-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is your proposed feature related to a problem?
When using CRD mode in large clusters, cilium-agent can sometimes overload apiserver. This can lead to a vicious cycle where cilium-agent LIST requests overload apiserver, which then causes LIST requests to fail, so cilium-agent resends the LIST request, ... and the cluster never recovers.
Describe the feature you'd like
Update the Cilium helm chart to configure kube client-go exponential backoff in cilium-agent by default.
(Optional) Describe your proposed solution
In AKS, we have been recommending that customers running Cilium at large scale in CRD mode set the following environment variables in the cilium-agent daemonset:
- name: KUBE_CLIENT_BACKOFF_BASE
value: "1"
- name: KUBE_CLIENT_BACKOFF_DURATION
value: "120"
We have used this technique to successfully mitigate many production incidents, and we enable it by default in AKS-managed Cilium.
However, most Cilium users don't know that this option exists, and they need to go out of their way to configure it.