Skip to content

Nodes hang to update their cluster node list #24574

@maximumG

Description

@maximumG

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We have recently updated cilium from v1.10.8 to v1.13.0 on our EKS cluster running in v1.24.10. Our setup leverages AWS spot instances and as such, nodes comes and goes in the cluster.

We started experimenting some inter-node connectivity issue while running cilium v1.13.0. The symptom is that pod shosted on some nodes are not able to communicate with pods hosted on some other recently joined nodes.

After a few investigation, we found out that a specific cilium agent running on a node was not able to update its node list. As such some old nodes were still in the cilium status --verbose output but with an unreachable status. Of course these nodes were no longer part of the cluster and were replaced by new nodes. On the other end, these new nodes were not referenced at all in the cilium status --verbose. As a result, pods hosted on this stuck node were not able to communicate with pods running on any new nodes.

In the meantime, the cilium node list on the affected node was timing out with the following message. Any other cilium command gave the expected output.

$ cilium node list
Cilium API client timeout exceeded

As a workaround we deleted the cilium agent on the affected node and the newly created agent was able to catch up with the every node clusters and provided back full cluster connectivity.

Cilium Version

1.13.0 c9723a8 2023-02-15T14:18:31+01:00 go version go1.19.6 linux/amd64

Kernel Version

5.10.167-147.601

Kubernetes Version

v1.24.10-eks-48e63af

Sysdump

I can upload the sysdump for the whole cluster, on demand if anybody is interrested.

Relevant log output

$ cilium node list
Cilium API client timeout exceeded

$ cilium status --verbose
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.24+ (v1.24.10-eks-48e63af) [linux/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [eth0 10.80.6.245]
Host firewall:          Disabled
CNI Chaining:           none
CNI Config file:        CNI configuration file management disabled
Cilium:                 Ok   1.13.0 (v1.13.0-c9723a8d)
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 8/254 allocated from 100.64.4.0/24, 
Allocated addresses:
  100.64.4.107 (kube-system/node-local-dns-8r6np[restored])
  100.64.4.124 (router)
  100.64.4.125 (kube-monitoring/datadog-agent-lpw8v[restored])
  100.64.4.191 (teleport-kube-agent/teleport-kube-agent-0[restored])
  100.64.4.241 (nginx-ingress-internal/nginx-ingress-internal-ingress-nginx-controller-7c75db779dcmxp8[restored])
  100.64.4.249 (health)
  100.64.4.25 (nginx-ingress/nginx-ingress-ingress-nginx-controller-7f8fc5d6fd-8smbg[restored])
  100.64.4.28 (kube-system/ebs-csi-node-vplnz[restored])
IPv6 BIG TCP:           Disabled
BandwidthManager:       Disabled
Host Routing:           Legacy
Masquerading:           IPTables [IPv4: Enabled, IPv6: Disabled]
Clock Source for BPF:   ktime
Controller Status:      49/49 healthy
  Name                                          Last success   Last error     Count   Message
  bpf-map-sync-cilium_lxc                       5s ago         never          0       no error   
  cilium-health-ep                              33s ago        never          0       no error   
  dns-garbage-collector-job                     49s ago        never          0       no error   
  endpoint-2005-regeneration-recovery           never          never          0       no error   
  endpoint-2201-regeneration-recovery           never          never          0       no error   
  endpoint-2409-regeneration-recovery           never          never          0       no error   
  endpoint-3268-regeneration-recovery           never          never          0       no error   
  endpoint-3894-regeneration-recovery           never          never          0       no error   
  endpoint-406-regeneration-recovery            never          never          0       no error   
  endpoint-847-regeneration-recovery            never          never          0       no error   
  endpoint-958-regeneration-recovery            never          never          0       no error   
  endpoint-gc                                   2m51s ago      never          0       no error   
  ipcache-inject-labels                         2h37m26s ago   52h2m50s ago   0       no error   
  k8s-heartbeat                                 12s ago        never          0       no error   
  metricsmap-bpf-prom-sync                      5s ago         never          0       no error   
  resolve-identity-3268                         2m37s ago      never          0       no error   
  resolve-identity-406                          2m36s ago      never          0       no error   
  restoring-ep-identity (2005)                  52h2m37s ago   never          0       no error   
  restoring-ep-identity (2201)                  52h2m37s ago   never          0       no error   
  restoring-ep-identity (2409)                  52h2m37s ago   never          0       no error   
  restoring-ep-identity (3268)                  52h2m37s ago   never          0       no error   
  restoring-ep-identity (3894)                  52h2m37s ago   never          0       no error   
  restoring-ep-identity (847)                   52h2m37s ago   never          0       no error   
  restoring-ep-identity (958)                   52h2m37s ago   never          0       no error   
  sync-endpoints-and-host-ips                   1h17m34s ago   never          0       no error   
  sync-lb-maps-with-k8s-services                52h2m37s ago   never          0       no error   
  sync-policymap-2005                           25s ago        never          0       no error   
  sync-policymap-2201                           25s ago        never          0       no error   
  sync-policymap-2409                           25s ago        never          0       no error   
  sync-policymap-3268                           25s ago        never          0       no error   
  sync-policymap-3894                           25s ago        never          0       no error   
  sync-policymap-406                            25s ago        never          0       no error   
  sync-policymap-847                            25s ago        never          0       no error   
  sync-policymap-958                            25s ago        never          0       no error   
  sync-to-k8s-ciliumendpoint (2005)             5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (2201)             5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (2409)             6s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (3268)             5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (3894)             5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (406)              5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (847)              5s ago         never          0       no error   
  sync-to-k8s-ciliumendpoint (958)              5s ago         never          0       no error   
  template-dir-watcher                          never          never          0       no error   
  waiting-initial-global-identities-ep (2005)   52h2m37s ago   never          0       no error   
  waiting-initial-global-identities-ep (2201)   52h2m37s ago   never          0       no error   
  waiting-initial-global-identities-ep (3268)   52h2m37s ago   never          0       no error   
  waiting-initial-global-identities-ep (3894)   52h2m37s ago   never          0       no error   
  waiting-initial-global-identities-ep (847)    52h2m37s ago   never          0       no error   
  waiting-initial-global-identities-ep (958)    52h2m37s ago   never          0       no error   
Proxy Status:            No managed proxy redirect
Global Identity Range:   min 256, max 65535
Hubble:                  Disabled
KubeProxyReplacement Details:
  Status:                 Strict
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Devices:                eth0 10.80.6.245
  Mode:                   SNAT
  Backend Selection:      Maglev (Table Size: 16381)
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767) 
  - LoadBalancer:   Enabled 
  - externalIPs:    Enabled 
  - HostPort:       Enabled
BPF Maps:   dynamic sizing: on (ratio: 0.002500)
  Name                          Size
  Non-TCP connection tracking   65536
  TCP connection tracking       131072
  Endpoint policy               65535
  Events                        2
  IP cache                      512000
  IP masquerading agent         16384
  IPv4 fragmentation            8192
  IPv4 service                  65536
  IPv6 service                  65536
  IPv4 service backend          65536
  IPv6 service backend          65536
  IPv4 service reverse NAT      65536
  IPv6 service reverse NAT      65536
  Metrics                       1024
  NAT                           131072
  Neighbor table                131072
  Global policy                 16384
  Per endpoint policy           65536
  Session affinity              65536
  Signal                        2
  Sockmap                       65535
  Sock reverse NAT              65536
  Tunnel                        65536
Encryption:                                               Wireguard        [cilium_wg0 (Pubkey: B06RhcrQEXD49uQvMoGEnRmzGszTmfaclaxayOUU9lo=, Port: 51871, Peers: 11)]
Cluster health:                                           9/12 reachable   (2023-03-24T17:18:13Z)
  Name                                                    IP               Node          Endpoints
  ip-10-80-6-245.eu-west-1.compute.internal (localhost)   10.80.6.245      reachable     reachable
  ip-10-80-4-101.eu-west-1.compute.internal               10.80.4.101      reachable     reachable
  ip-10-80-4-143.eu-west-1.compute.internal               10.80.4.143      unreachable   unreachable # node gone for a while
  ip-10-80-4-229.eu-west-1.compute.internal               10.80.4.229      reachable     reachable
  ip-10-80-4-67.eu-west-1.compute.internal                10.80.4.67       unreachable   unreachable # node gone for a while
  ip-10-80-5-13.eu-west-1.compute.internal                10.80.5.13       reachable     reachable
  ip-10-80-5-169.eu-west-1.compute.internal               10.80.5.169      reachable     reachable
  ip-10-80-5-220.eu-west-1.compute.internal               10.80.5.220      reachable     reachable
  ip-10-80-5-223.eu-west-1.compute.internal               10.80.5.223      reachable     reachable
  ip-10-80-6-118.eu-west-1.compute.internal               10.80.6.118      reachable     reachable
  ip-10-80-6-127.eu-west-1.compute.internal               10.80.6.127      reachable     reachable
  ip-10-80-6-96.eu-west-1.compute.internal                10.80.6.96       unreachable   unreachable # node gone for a while

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions