-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
In an environment with 3 control-plane (master) nodes and 3 worker nodes, the communication from pods running on the master nodes to the kubernetes service endpoints is broken after upgrading k8s version from 1.23.5 to 1.23.6 (or from 1.22.2 to 1.22.3). The connectivity still works from the worker (non-master) nodes. The issue can be recovered by restarting the cilium pods running on the master nodes.
Nodes setup:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-172-31-141-179.eu-west-3.compute.internal Ready control-plane,master 91m v1.23.6 172.31.141.179 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
ip-172-31-141-32.eu-west-3.compute.internal Ready <none> 84m v1.23.5 172.31.141.32 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
ip-172-31-142-157.eu-west-3.compute.internal Ready control-plane,master 90m v1.23.6 172.31.142.157 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
ip-172-31-142-232.eu-west-3.compute.internal Ready <none> 84m v1.23.5 172.31.142.232 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
ip-172-31-143-187.eu-west-3.compute.internal Ready control-plane,master 89m v1.23.6 172.31.143.187 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
ip-172-31-143-91.eu-west-3.compute.internal Ready <none> 84m v1.23.5 172.31.143.91 <redacted> Ubuntu 20.04.4 LTS 5.13.0-1022-aws containerd://1.4.13
kubernetes service endpoints:
$ kubectl get endpoints kubernetes
NAME ENDPOINTS AGE
kubernetes 172.31.141.179:6443,172.31.142.157:6443,172.31.143.187:6443 101m
apiservers are running in the host namespace of the master nodes:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-apiserver-ip-172-31-141-179.eu-west-3.compute.internal 1/1 Running 0 93m 172.31.141.179 ip-172-31-141-179.eu-west-3.compute.internal <none> <none>
kube-system kube-apiserver-ip-172-31-142-157.eu-west-3.compute.internal 1/1 Running 0 88m 172.31.142.157 ip-172-31-142-157.eu-west-3.compute.internal <none> <none>
kube-system kube-apiserver-ip-172-31-143-187.eu-west-3.compute.internal 1/1 Running 0 72m 172.31.143.187 ip-172-31-143-187.eu-west-3.compute.internal <none> <none>
Pods running on the master nodes cannot connect to the kubernetes service after kubernetes version upgrade.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 109m
curl --connect-timeout 5 --insecure https://10.96.0.1:443
curl: (28) Connection timeout after 5001 ms
Connecting directly to the endpoint IPs does not work either (none of them):
curl --connect-timeout 5 --insecure https://172.31.141.179:6443
curl: (28) Connection timeout after 5001 ms
Connectivity to other pods and services works fine from the same pod.
Pods running on worker nodes can connect to the kubernetes service (and its endpoints) successfully:
curl --connect-timeout 5 --insecure https://10.96.0.1:443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
}#
curl --connect-timeout 5 --insecure https://172.31.141.179:6443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
}#
After restart of the cilium pod on the affected master node, the issue recovers and broken pods can communicate with the kubernetes service successfully.
Cilium Version
v1.11.1
Kernel Version
Linux ip-172-31-141-179 5.13.0-1022-aws #24~20.04.1-Ubuntu SMP Thu Apr 7 22:10:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.23.6
Sysdump
cilium-sysdump-20220510-154929.zip
Relevant log output
No response
Anything else?
cilium monitor --type drop
does not show anything on affected node.
Code of Conduct
- I agree to follow this project's Code of Conduct