-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.16.0 and lower than v1.17.0
What happened?
We have discovered that when churning (creating and deleting) identities by starting and stopping pods, some identities are never properly cleaned up by the identity GC. We have tracked this down to a race between the syncLocalKeys function running in the background on an interval, and the release of the node usage of identities.
We correlate this with "Re-created missing slave key" logline, and see that the keys mentioned in these loglines are the ones leaked.
We see this happen once in a while, but with long lived hosts, this can result in a lot of leaked identities in large clusters.
How can we reproduce the issue?
Install cilium with kvstore / etcd mode in eg. kind. Set the kvstore-periodic-sync
to a low value to reproduce easier. eg. to 10ms
.
Then create and delete some pods with unique labels in a loop eg. like;
counter=0
while true; do
kubectl run --rm -it -n default --image alpine testing-${counter} -- sleep 1
counter=$((counter+1))
sleep .1
done
Then check the logs of the agent(s) the pods run on;
$ logs -n kube-system -l k8s-app=cilium -f | grep "slave"
time="2024-10-21T07:35:49.925108545Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-0;/172.18.0.2" subsys=kvstorebackend
time="2024-10-21T07:35:51.674576796Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-1;/172.18.0.2" subsys=kvstorebackend
time="2024-10-21T07:35:57.025689132Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-2;/172.18.0.2" subsys=kvstorebackend
Then check if any identity node values / slave keys are leaked, even when all the pods are deleted;
$ kubectl exec -n kube-system daemonset/cilium -- cilium kvstore get --recursive "cilium/state/identities/v1/value/" 2>&1 | grep testing-1
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-1;/172.18.0.2 => 36729
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-16;/172.18.0.2 => 7225
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-12;/172.18.0.2 => 4687
[...]
This means these identities won't be GCed by the cilium-operator, since there are still references to them. The keys are written with the node lease, so even tho. they are no longer "updated", the lease will keep them alive.
The only mitigation so far is either cleaning up manually in etcd, or restarting the cilium-agent to make it get a new lease - and then wait for the initial lease to be cleaned up when it expires.
Cilium Version
Seems to affect all. We are seeing this in both v1.13 and latest master. Latest reproduced on is 1ede9d9, so all versions seem expected.
Kernel Version
n/a
Kubernetes Version
n/a
Regression
No response
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct