Skip to content

CiliumEndpoint references a removed CiliumIdentity #19877

@ysksuzuki

Description

@ysksuzuki

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We encountered an issue in which a CiliumEndpoint referenced a CiliumIdenity that was already removed by the cilium-operator GC. A corresponding pod couldn't communicate with other pods except those running on the same node because of it. The warning Unable to release newly allocated identity again appeared in the cilium-agent log during the time the problem happened.

$ kubectl get ciliumidentities.cilium.io 19805
Error from server (NotFound): ciliumidentities.cilium.io "19805" not found

$ kubectl -n app-mysql get ciliumendpoints.cilium.io
NAME            ENDPOINT ID   IDENTITY ID   INGRESS ENFORCEMENT   EGRESS ENFORCEMENT   VISIBILITY POLICY   ENDPOINT STATE   IPV4           IPV6
moco-0          2001          19805    

cilium-agent log

2022-05-11 21:01:03  level=warning msg="Unable to release newly allocated identity again" containerID= datapathPolicyRevision=37 desiredPolicyRevision=37 endpointID=2371 error="identity sync was cancelled: context canceled" identity=19805 identityLabels="k8s:app.kubernetes.io/created-by=moco,k8s:app.kubernetes.io/instance=stage0-ocean-1,k8s:app.kubernetes.io/name=mysql,k8s:io.cilium.k8s.namespace.labels.accurate.cybozu.com/parent=team-dbre,k8s:io.cilium.k8s.namespace.labels.cybozu.com/alert-group=dbre,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=app-cybozu-com-mysql,k8s:io.cilium.k8s.namespace.labels.pod-security.cybozu.com/policy=traceable,k8s:io.cilium.k8s.namespace.labels.team=dbre,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=moco-stage0-ocean-1,k8s:io.kubernetes.pod.namespace=app-cybozu-com-mysql,k8s:moco.cybozu.com/role=replica,k8s:statefulset.kubernetes.io/pod-name=moco-stage0-ocean-1-1" ipv4= ipv6= k8sPodName=/ subsys=endpoint

operator-log

2022-05-11 21:12:45	level=info msg="Garbage collected identity" identity=19805 subsys=cilium-operator-generic

hubble log

  "destination": {
    "identity": 19805,
    "namespace": "app-mysql",
    "pod_name": "moco-0"
  },
  "Type": "L3_L4",
  "node_name": "10.69.3.9",
  "event_type": {
    "type": 5
  },
  "traffic_direction": "EGRESS",
  "drop_reason_desc": "POLICY_DENIED",
  "Summary": "TCP Flags: SYN"

How to reproduce

  1. Create a stateful pod using the following manifest.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: suzuki-test
  namespace: dctest
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: suzuki-test
  serviceName: suzuki-test
  template:
    metadata:
      labels:
        app.kubernetes.io/name: suzuki-test
    spec:
      containers:
      - image: quay.io/cybozu/testhttpd:0
        name: testhttpd
        volumeMounts:
        - name: www
          mountPath: /usr/share/suzuki-test/html
      restartPolicy: Always
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: www
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: topolvm-provisioner
      volumeMode: Filesystem
  1. Taint the node where the pod is running, and add a label to the pod
kubectl taint nodes ${node} node.cybozu.io/node-not-ready=true:NoExecute
kubectl -n dctest label pods suzuki-test-0 suzuki-test.cybozu.com/role=primary
  1. If all goes as expected, the following logs will appear. The resolve-identity controller allocates a new global key. Then, however, it stops setting the identity and try to releases it, but it can't because the endpoint is terminating state and the controller process is canceled and removed.
level=warning msg="Unable to release newly allocated identity again" containerID= datapathPolicyRevision=41 desiredPolicyRevision=41 endpointID=74 error="initial global identity sync was cancelled: context canceled" identity=46330 identityLabels="k8s:app.kubernetes.io/name=suzuki-test,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=dctest,k8s:io.cilium.k8s.namespace.labels.pod-security.cybozu.com/policy=privileged,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=dctest,k8s:statefulset.kubernetes.io/pod-name=suzuki-test-0,k8s:suzuki-test.cybozu.com/role=primary" ipv4= ipv6= k8sPodName=/ subsys=endpoint
  1. Wait for a while until the operator removes the allocated identity. (The pod is pending)
  2. Untaint the node and add the label. Then a newly created endpoint references the identity already removed.
kubectl taint nodes ${node} node.cybozu.io/node-not-ready=true:NoExecute-
kubectl -n dctest label pods suzuki-test-0 suzuki-test.cybozu.com/role=primary

Cilium Version

cilium v1.11.5

Kernel Version

Linux rack0-cs4 5.15.37-flatcar #1 SMP Wed May 4 13:53:25 -00 2022 x86_64 AMD EPYC 7413 24-Core Processor AuthenticAMD GNU/Linux

Kubernetes Version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"archive", BuildDate:"2022-03-23T08:04:34Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

No response

My sysdump is too big to upload

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions