-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
We have recently updated cilium from v1.10.8
to v1.13.0
on our EKS cluster running in v1.24.10
. Our setup leverages AWS spot instances and as such, nodes comes and goes in the cluster.
We started experimenting some inter-node connectivity issue while running cilium v1.13.0
. The symptom is that pod shosted on some nodes are not able to communicate with pods hosted on some other recently joined nodes.
After a few investigation, we found out that a specific cilium agent running on a node was not able to update its node list. As such some old nodes were still in the cilium status --verbose
output but with an unreachable
status. Of course these nodes were no longer part of the cluster and were replaced by new nodes. On the other end, these new nodes were not referenced at all in the cilium status --verbose
. As a result, pods hosted on this stuck node were not able to communicate with pods running on any new nodes.
In the meantime, the cilium node list
on the affected node was timing out with the following message. Any other cilium command gave the expected output.
$ cilium node list
Cilium API client timeout exceeded
As a workaround we deleted the cilium agent on the affected node and the newly created agent was able to catch up with the every node clusters and provided back full cluster connectivity.
Cilium Version
1.13.0 c9723a8 2023-02-15T14:18:31+01:00 go version go1.19.6 linux/amd64
Kernel Version
5.10.167-147.601
Kubernetes Version
v1.24.10-eks-48e63af
Sysdump
I can upload the sysdump for the whole cluster, on demand if anybody is interrested.
Relevant log output
$ cilium node list
Cilium API client timeout exceeded
$ cilium status --verbose
KVStore: Ok Disabled
Kubernetes: Ok 1.24+ (v1.24.10-eks-48e63af) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 10.80.6.245]
Host firewall: Disabled
CNI Chaining: none
CNI Config file: CNI configuration file management disabled
Cilium: Ok 1.13.0 (v1.13.0-c9723a8d)
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 8/254 allocated from 100.64.4.0/24,
Allocated addresses:
100.64.4.107 (kube-system/node-local-dns-8r6np[restored])
100.64.4.124 (router)
100.64.4.125 (kube-monitoring/datadog-agent-lpw8v[restored])
100.64.4.191 (teleport-kube-agent/teleport-kube-agent-0[restored])
100.64.4.241 (nginx-ingress-internal/nginx-ingress-internal-ingress-nginx-controller-7c75db779dcmxp8[restored])
100.64.4.249 (health)
100.64.4.25 (nginx-ingress/nginx-ingress-ingress-nginx-controller-7f8fc5d6fd-8smbg[restored])
100.64.4.28 (kube-system/ebs-csi-node-vplnz[restored])
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
Clock Source for BPF: ktime
Controller Status: 49/49 healthy
Name Last success Last error Count Message
bpf-map-sync-cilium_lxc 5s ago never 0 no error
cilium-health-ep 33s ago never 0 no error
dns-garbage-collector-job 49s ago never 0 no error
endpoint-2005-regeneration-recovery never never 0 no error
endpoint-2201-regeneration-recovery never never 0 no error
endpoint-2409-regeneration-recovery never never 0 no error
endpoint-3268-regeneration-recovery never never 0 no error
endpoint-3894-regeneration-recovery never never 0 no error
endpoint-406-regeneration-recovery never never 0 no error
endpoint-847-regeneration-recovery never never 0 no error
endpoint-958-regeneration-recovery never never 0 no error
endpoint-gc 2m51s ago never 0 no error
ipcache-inject-labels 2h37m26s ago 52h2m50s ago 0 no error
k8s-heartbeat 12s ago never 0 no error
metricsmap-bpf-prom-sync 5s ago never 0 no error
resolve-identity-3268 2m37s ago never 0 no error
resolve-identity-406 2m36s ago never 0 no error
restoring-ep-identity (2005) 52h2m37s ago never 0 no error
restoring-ep-identity (2201) 52h2m37s ago never 0 no error
restoring-ep-identity (2409) 52h2m37s ago never 0 no error
restoring-ep-identity (3268) 52h2m37s ago never 0 no error
restoring-ep-identity (3894) 52h2m37s ago never 0 no error
restoring-ep-identity (847) 52h2m37s ago never 0 no error
restoring-ep-identity (958) 52h2m37s ago never 0 no error
sync-endpoints-and-host-ips 1h17m34s ago never 0 no error
sync-lb-maps-with-k8s-services 52h2m37s ago never 0 no error
sync-policymap-2005 25s ago never 0 no error
sync-policymap-2201 25s ago never 0 no error
sync-policymap-2409 25s ago never 0 no error
sync-policymap-3268 25s ago never 0 no error
sync-policymap-3894 25s ago never 0 no error
sync-policymap-406 25s ago never 0 no error
sync-policymap-847 25s ago never 0 no error
sync-policymap-958 25s ago never 0 no error
sync-to-k8s-ciliumendpoint (2005) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (2201) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (2409) 6s ago never 0 no error
sync-to-k8s-ciliumendpoint (3268) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (3894) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (406) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (847) 5s ago never 0 no error
sync-to-k8s-ciliumendpoint (958) 5s ago never 0 no error
template-dir-watcher never never 0 no error
waiting-initial-global-identities-ep (2005) 52h2m37s ago never 0 no error
waiting-initial-global-identities-ep (2201) 52h2m37s ago never 0 no error
waiting-initial-global-identities-ep (3268) 52h2m37s ago never 0 no error
waiting-initial-global-identities-ep (3894) 52h2m37s ago never 0 no error
waiting-initial-global-identities-ep (847) 52h2m37s ago never 0 no error
waiting-initial-global-identities-ep (958) 52h2m37s ago never 0 no error
Proxy Status: No managed proxy redirect
Global Identity Range: min 256, max 65535
Hubble: Disabled
KubeProxyReplacement Details:
Status: Strict
Socket LB: Enabled
Socket LB Tracing: Enabled
Devices: eth0 10.80.6.245
Mode: SNAT
Backend Selection: Maglev (Table Size: 16381)
Session Affinity: Enabled
Graceful Termination: Enabled
NAT46/64 Support: Disabled
XDP Acceleration: Disabled
Services:
- ClusterIP: Enabled
- NodePort: Enabled (Range: 30000-32767)
- LoadBalancer: Enabled
- externalIPs: Enabled
- HostPort: Enabled
BPF Maps: dynamic sizing: on (ratio: 0.002500)
Name Size
Non-TCP connection tracking 65536
TCP connection tracking 131072
Endpoint policy 65535
Events 2
IP cache 512000
IP masquerading agent 16384
IPv4 fragmentation 8192
IPv4 service 65536
IPv6 service 65536
IPv4 service backend 65536
IPv6 service backend 65536
IPv4 service reverse NAT 65536
IPv6 service reverse NAT 65536
Metrics 1024
NAT 131072
Neighbor table 131072
Global policy 16384
Per endpoint policy 65536
Session affinity 65536
Signal 2
Sockmap 65535
Sock reverse NAT 65536
Tunnel 65536
Encryption: Wireguard [cilium_wg0 (Pubkey: B06RhcrQEXD49uQvMoGEnRmzGszTmfaclaxayOUU9lo=, Port: 51871, Peers: 11)]
Cluster health: 9/12 reachable (2023-03-24T17:18:13Z)
Name IP Node Endpoints
ip-10-80-6-245.eu-west-1.compute.internal (localhost) 10.80.6.245 reachable reachable
ip-10-80-4-101.eu-west-1.compute.internal 10.80.4.101 reachable reachable
ip-10-80-4-143.eu-west-1.compute.internal 10.80.4.143 unreachable unreachable # node gone for a while
ip-10-80-4-229.eu-west-1.compute.internal 10.80.4.229 reachable reachable
ip-10-80-4-67.eu-west-1.compute.internal 10.80.4.67 unreachable unreachable # node gone for a while
ip-10-80-5-13.eu-west-1.compute.internal 10.80.5.13 reachable reachable
ip-10-80-5-169.eu-west-1.compute.internal 10.80.5.169 reachable reachable
ip-10-80-5-220.eu-west-1.compute.internal 10.80.5.220 reachable reachable
ip-10-80-5-223.eu-west-1.compute.internal 10.80.5.223 reachable reachable
ip-10-80-6-118.eu-west-1.compute.internal 10.80.6.118 reachable reachable
ip-10-80-6-127.eu-west-1.compute.internal 10.80.6.127 reachable reachable
ip-10-80-6-96.eu-west-1.compute.internal 10.80.6.96 unreachable unreachable # node gone for a while
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct