Skip to content

Connectivity to kubernetes service endpoints broken after k8s version upgrade #19761

@rastislavs

Description

@rastislavs

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

In an environment with 3 control-plane (master) nodes and 3 worker nodes, the communication from pods running on the master nodes to the kubernetes service endpoints is broken after upgrading k8s version from 1.23.5 to 1.23.6 (or from 1.22.2 to 1.22.3). The connectivity still works from the worker (non-master) nodes. The issue can be recovered by restarting the cilium pods running on the master nodes.

Nodes setup:

$ kubectl get nodes -o wide
NAME                                           STATUS   ROLES                  AGE   VERSION   INTERNAL-IP      EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
ip-172-31-141-179.eu-west-3.compute.internal   Ready    control-plane,master   91m   v1.23.6   172.31.141.179   <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13
ip-172-31-141-32.eu-west-3.compute.internal    Ready    <none>                 84m   v1.23.5   172.31.141.32    <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13
ip-172-31-142-157.eu-west-3.compute.internal   Ready    control-plane,master   90m   v1.23.6   172.31.142.157   <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13
ip-172-31-142-232.eu-west-3.compute.internal   Ready    <none>                 84m   v1.23.5   172.31.142.232   <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13
ip-172-31-143-187.eu-west-3.compute.internal   Ready    control-plane,master   89m   v1.23.6   172.31.143.187   <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13
ip-172-31-143-91.eu-west-3.compute.internal    Ready    <none>                 84m   v1.23.5   172.31.143.91    <redacted>       Ubuntu 20.04.4 LTS   5.13.0-1022-aws   containerd://1.4.13

kubernetes service endpoints:

$ kubectl get endpoints kubernetes                                                                                                                                                                                  
NAME         ENDPOINTS                                                     AGE
kubernetes   172.31.141.179:6443,172.31.142.157:6443,172.31.143.187:6443   101m

apiservers are running in the host namespace of the master nodes:

NAMESPACE     NAME                                                                   READY   STATUS             RESTARTS         AGE
kube-system   kube-apiserver-ip-172-31-141-179.eu-west-3.compute.internal            1/1     Running            0                93m    172.31.141.179   ip-172-31-141-179.eu-west-3.compute.internal   <none>           <none>
kube-system   kube-apiserver-ip-172-31-142-157.eu-west-3.compute.internal            1/1     Running            0                88m    172.31.142.157   ip-172-31-142-157.eu-west-3.compute.internal   <none>           <none>
kube-system   kube-apiserver-ip-172-31-143-187.eu-west-3.compute.internal            1/1     Running            0                72m    172.31.143.187   ip-172-31-143-187.eu-west-3.compute.internal   <none>           <none>

Pods running on the master nodes cannot connect to the kubernetes service after kubernetes version upgrade.

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   109m
curl --connect-timeout 5 --insecure https://10.96.0.1:443
curl: (28) Connection timeout after 5001 ms

Connecting directly to the endpoint IPs does not work either (none of them):

curl --connect-timeout 5 --insecure https://172.31.141.179:6443
curl: (28) Connection timeout after 5001 ms

Connectivity to other pods and services works fine from the same pod.

Pods running on worker nodes can connect to the kubernetes service (and its endpoints) successfully:

curl --connect-timeout 5 --insecure https://10.96.0.1:443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}#

curl --connect-timeout 5 --insecure https://172.31.141.179:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}#

After restart of the cilium pod on the affected master node, the issue recovers and broken pods can communicate with the kubernetes service successfully.

Cilium Version

v1.11.1

Kernel Version

Linux ip-172-31-141-179 5.13.0-1022-aws #24~20.04.1-Ubuntu SMP Thu Apr 7 22:10:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.23.6

Sysdump

cilium-sysdump-20220510-154929.zip

Relevant log output

No response

Anything else?

cilium monitor --type drop does not show anything on affected node.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugThis is a bug in the Cilium logic.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions