Skip to content

kvstore implementation leaks cilium identities during identity churn #35451

@odinuge

Description

@odinuge

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.16.0 and lower than v1.17.0

What happened?

We have discovered that when churning (creating and deleting) identities by starting and stopping pods, some identities are never properly cleaned up by the identity GC. We have tracked this down to a race between the syncLocalKeys function running in the background on an interval, and the release of the node usage of identities.

We correlate this with "Re-created missing slave key" logline, and see that the keys mentioned in these loglines are the ones leaked.

We see this happen once in a while, but with long lived hosts, this can result in a lot of leaked identities in large clusters.

How can we reproduce the issue?

Install cilium with kvstore / etcd mode in eg. kind. Set the kvstore-periodic-sync to a low value to reproduce easier. eg. to 10ms.

Then create and delete some pods with unique labels in a loop eg. like;

counter=0
while true; do
  kubectl run --rm -it -n default --image alpine testing-${counter} -- sleep 1
  counter=$((counter+1))
  sleep .1
done

Then check the logs of the agent(s) the pods run on;

$ logs -n kube-system -l k8s-app=cilium  -f | grep "slave"
time="2024-10-21T07:35:49.925108545Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-0;/172.18.0.2" subsys=kvstorebackend
time="2024-10-21T07:35:51.674576796Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-1;/172.18.0.2" subsys=kvstorebackend
time="2024-10-21T07:35:57.025689132Z" level=warning msg="Re-created missing slave key" key="cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-2;/172.18.0.2" subsys=kvstorebackend

Then check if any identity node values / slave keys are leaked, even when all the pods are deleted;

$ kubectl exec -n kube-system daemonset/cilium -- cilium kvstore get --recursive "cilium/state/identities/v1/value/" 2>&1 | grep testing-1
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-1;/172.18.0.2 => 36729
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-16;/172.18.0.2 => 7225
cilium/state/identities/v1/value/k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:run=testing-12;/172.18.0.2 => 4687
[...]

This means these identities won't be GCed by the cilium-operator, since there are still references to them. The keys are written with the node lease, so even tho. they are no longer "updated", the lease will keep them alive.

The only mitigation so far is either cleaning up manually in etcd, or restarting the cilium-agent to make it get a new lease - and then wait for the initial lease to be cleaned up when it expires.

Cilium Version

Seems to affect all. We are seeing this in both v1.13 and latest master. Latest reproduced on is 1ede9d9, so all versions seem expected.

Kernel Version

n/a

Kubernetes Version

n/a

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions