Skip to content

Endpoint restoration/regeneration of completed job stuck in infinite loop (error="retrieving device lxc*: Link not found") #39370

@giorio94

Description

@giorio94

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.17.3 and lower than v1.18.0

What happened?

Hit in a CI run: https://github.com/cilium/cilium/actions/runs/14863456177/job/41734848140

Essentially, one of the Cilium agents is stuck attempting to restore the endpoint of a job that most likely completed right before or while the Cilium agent on the hosting node got restarted, and fails because the corresponding link no longer exists.

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                    IPv6                  IPv4           STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                                                      
1362       Disabled           Disabled          16756690   k8s:batch.kubernetes.io/job-name=clustermesh-apiserver-generate-certs          fd00:10:244:2::917d   10.244.2.44    not-ready   
                                                           k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system                                                      
                                                           k8s:io.cilium.k8s.policy.cluster=cluster2-with-long-name-01234567                                                               
                                                           k8s:io.cilium.k8s.policy.serviceaccount=clustermesh-apiserver-generate-certs                                                    
                                                           k8s:io.kubernetes.pod.namespace=kube-system                                                                                     
                                                           k8s:job-name=clustermesh-apiserver-generate-certs                                                                               
                                                           k8s:k8s-app=clustermesh-apiserver-generate-certs                                                                                
2025-05-06T15:36:41.438279873Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:610 msg="Error while reloading endpoint BPF program" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:36:41.438747716Z level=warn source=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:607 msg="Regeneration of endpoint failed" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint reason="syncing state to host" waitingForPolicyRepository=361ns prepareBuild=28.113µs total=12.088630966s waitingForLock=4.809098796s proxyConfiguration=0s proxyPolicyCalculation=980.03µs mapSync=0s bpfCompilation=6.854898416s bpfWaitForELF=6.870777505s waitingForCTClean=391ns selectorPolicyCalculation=0s endpointPolicyCalculation=0s proxyWaitForAck=0s policyCalculation=413.032µs bpfLoadProg=405.58053ms bpfCompilation=6.854898416s bpfWaitForELF=6.870777505s bpfLoadProg=405.58053ms error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:36:41.439164485Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:803 msg="endpoint regeneration failed" ipv4=10.244.2.44 containerInterface=eth0 k8sPodName=kube-system/clustermesh-apiserver-generate-certs-4kgkm containerID=5072ef6b95 desiredPolicyRevision=1 endpointID=1362 datapathPolicyRevision=0 ipv6=fd00:10:244:2::917d ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm identity=16756690 subsys=endpoint error="retrieving device lxc863c5237af4c: Link not found"
2025-05-06T15:37:10.223490385Z level=warning msg="Unable to assert if endpoint BPF programs need to be reloaded" source="/go/src/github.com/cilium/cilium/daemon/cmd/watchdogs.go:115" ciliumEndpointName=kube-system/clustermesh-apiserver-generate-certs-4kgkm endpoint=lxc863c5237af4c endpointID=1362 error="retrieving device lxc863c5237af4c: Link not found" subsys=daemon

How can we reproduce the issue?

Haven't attempted reproducing locally yet, but I assume restating the Cilium agent while a job completes should trigger it.

Cilium Version

Hit on tip of main, unsure whether this affects any stable version as well.

Sysdump

cilium-sysdump-context2-final-2-disabled-dual-wireguard-none-clustermesh-cronJob-migration-migration-511-disabled.zip

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

affects/mainThis issue affects main branchaffects/v1.18This issue affects v1.18 brancharea/agentCilium agent related.area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.ci/quarantineTrack for quarantine to minimize developer impactkind/bugThis is a bug in the Cilium logic.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions