-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Closed
Labels
affects/mainThis issue affects main branchThis issue affects main branchaffects/v1.18This issue affects v1.18 branchThis issue affects v1.18 brancharea/agentCilium agent related.Cilium agent related.area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.Impacts the Container Networking Interface between Cilium and the orchestrator.ci/quarantineTrack for quarantine to minimize developer impactTrack for quarantine to minimize developer impactkind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.17.3 and lower than v1.18.0
What happened?
Hit in a CI run: https://github.com/cilium/cilium/actions/runs/14863456177/job/41734848140
Essentially, one of the Cilium agents is stuck attempting to restore the endpoint of a job that most likely completed right before or while the Cilium agent on the hosting node got restarted, and fails because the corresponding link no longer exists.
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
1362 Disabled Disabled 16756690 k8s:batch.kubernetes.io/job-name=clustermesh-apiserver-generate-certs fd00:10:244:2::917d 10.244.2.44 not-ready
k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system
k8s:io.cilium.k8s.policy.cluster=cluster2-with-long-name-01234567
k8s:io.cilium.k8s.policy.serviceaccount=clustermesh-apiserver-generate-certs
k8s:io.kubernetes.pod.namespace=kube-system
k8s:job-name=clustermesh-apiserver-generate-certs
k8s:k8s-app=clustermesh-apiserver-generate-certs
2025-05-06T15:36:41.438279873Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:610 msg="Error while reloading endpoint BPF program" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:36:41.438747716Z level=warn source=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:607 msg="Regeneration of endpoint failed" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint reason="syncing state to host" waitingForPolicyRepository=361ns prepareBuild=28.113µs total=12.088630966s waitingForLock=4.809098796s proxyConfiguration=0s proxyPolicyCalculation=980.03µs mapSync=0s bpfCompilation=6.854898416s bpfWaitForELF=6.870777505s waitingForCTClean=391ns selectorPolicyCalculation=0s endpointPolicyCalculation=0s proxyWaitForAck=0s policyCalculation=413.032µs bpfLoadProg=405.58053ms bpfCompilation=6.854898416s bpfWaitForELF=6.870777505s bpfLoadProg=405.58053ms error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:36:41.439164485Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:803 msg="endpoint regeneration failed" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:37:10.223490385Z level=warning msg="Unable to assert if endpoint BPF programs need to be reloaded" source="/go/src/github.com/cilium/cilium/daemon/cmd/watchdogs.go:115" ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm endpoint=lxc863c5237af4c endpointID=1362 error="retrieving device lxc863c5237af4c: Link not found" subsys=daemon
How can we reproduce the issue?
Haven't attempted reproducing locally yet, but I assume restating the Cilium agent while a job completes should trigger it.
Cilium Version
Hit on tip of main, unsure whether this affects any stable version as well.
Sysdump
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
affects/mainThis issue affects main branchThis issue affects main branchaffects/v1.18This issue affects v1.18 branchThis issue affects v1.18 brancharea/agentCilium agent related.Cilium agent related.area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.Impacts the Container Networking Interface between Cilium and the orchestrator.ci/quarantineTrack for quarantine to minimize developer impactTrack for quarantine to minimize developer impactkind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.
Type
Projects
Status
Done