Skip to content

Kubelet probes fail for EKS pods with AWS security groups when cilium is deployed in CNI chaining mode with AWS VPC CNI. #24943

@vgrigoruk

Description

@vgrigoruk

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

  1. Create EKS cluster on AWS following this guide via eksctl:
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    
    metadata:
      name: vgrygoruk-2935
      region: eu-west-1
    
    managedNodeGroups:
    - name: ng-1
      desiredCapacity: 2
      privateNetworking: true
      taints:
      - key: "node.cilium.io/agent-not-ready"
        value: "true"
        effect: "NoExecute"
    
  2. Install "AWS VPC CNI", "CoreDNS" and "kube-proxy" EKS add-ons.
  3. Install cilium in "CNI Chaining" mode onto a AWS EKS cluster following official documentation.
    helm install cilium cilium/cilium --version 1.13.2 \
    --namespace kube-system \
    --set cni.chainingMode=aws-cni \
    --set cni.exclusive=false \
    --set enableIPv4Masquerade=false \
    --set tunnel=disabled \
    --set endpointRoutes.enabled=true
    
  4. Run cilium connectivity test
  5. Create SecurityGroupPolicy to attach pod security groups to echo pods in cilium-test namespace (as they have probes defined):
    ---
    apiVersion: vpcresources.k8s.aws/v1beta1
    kind: SecurityGroupPolicy
    metadata:
      name: test-app-psgp
      namespace: cilium-test
    spec:
      podSelector:
        matchLabels:
          kind: echo
      securityGroups:
        groupIds:
          - sg-08b53279c80ec19d9
    
  6. Delete echo-other-node-* pod in cilium-test namespace, so new pod uses it's own ENI with security group attached to it:
        Events:
      Type     Reason                  Age                    From                     Message
      ----     ------                  ----                   ----                     -------
      Normal   Scheduled               18m                    default-scheduler        Successfully assigned cilium-test/echo-other-node-78f77b57f8-lg8xg to ip-192-168-120-241.eu-west-1.compute.internal
      Normal   SecurityGroupRequested  18m                    vpc-resource-controller  Pod will get the following Security Groups [sg-08b53279c80ec19d9]
      Warning  FailedCreatePodSandBox  18m                    kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7dedc1e9a5dd13278502f2592cdc8e8d82276dab1f8aad1bbcb6380
    f72b8f415": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
      Normal   ResourceAllocated       18m                    vpc-resource-controller  Allocated [{"eniId":"eni-0b923d932ea2fe0af","ifAddress":"06:a4:ac:cd:ce:1b","privateIp":"192.168.103.13","vlanId":3,"subnetCidr":"192.168.96.0/19"}] to the
    pod
      Normal   Pulled                  18m                    kubelet                  Container image "quay.io/cilium/json-mock:v1.3.3@sha256:f26044a2b8085fcaa8146b6b8bb73556134d7ec3d5782c6a04a058c945924ca0" already present on machine
      Normal   Created                 18m                    kubelet                  Created container echo-other-node
      Normal   Started                 18m                    kubelet                  Started container echo-other-node
      Normal   Pulled                  18m                    kubelet                  Container image "docker.io/coredns/coredns:1.10.0@sha256:017727efcfeb7d053af68e51436ce8e65edbc6ca573720afb4f79c8594036955" already present on machine
      Normal   Created                 18m                    kubelet                  Created container dns-test-server
      Normal   Started                 18m                    kubelet                  Started container dns-test-server
      Warning  Unhealthy               18m (x9 over 18m)      kubelet                  Readiness probe failed: Get "http://192.168.103.13:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      Warning  Unhealthy               3m38s (x448 over 18m)  kubelet                  Readiness probe failed: Get "http://192.168.103.13:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    

Expected result: pod echo-other-node-* is running successfully. kubelet probes are not failing.
Actual result: echo-other-node-* is started, but kubelet probes are failing.

Notes:

  • security group references in security group policy allows ALL traffic on all ports (both ingress and egress) from/to VPC CIDR, so the issue is not related to rules in AWS Security groups.
  • echo-other-node-* pod is running successfully, if I uninstall cilium from the cluster and re-created the pod.
    helm uninstall cilium --namespace kube-system
    # re-create the pod and inspect pod events:
        Events:
      Type     Reason                  Age   From                     Message
      ----     ------                  ----  ----                     -------
      Normal   Scheduled               50s   default-scheduler        Successfully assigned cilium-test/echo-other-node-78f77b57f8-rz78p to ip-192-168-120-241.eu-west-1.compute.internal
      Normal   SecurityGroupRequested  50s   vpc-resource-controller  Pod will get the following Security Groups [sg-08b53279c80ec19d9]
      Warning  FailedCreatePodSandBox  50s   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d54d00652c58ec0ff2b08b11e0e72227f6c7bade61f1b1ff8224915bce227943": plugi
    n type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
      Normal   ResourceAllocated       49s   vpc-resource-controller  Allocated [{"eniId":"eni-0ba69eee6c7826dfe","ifAddress":"06:82:8f:17:be:e7","privateIp":"192.168.112.70","vlanId":1,"subnetCidr":"192.168.96.0/19"}] to the pod
      Normal   Pulled                  49s   kubelet                  Container image "quay.io/cilium/json-mock:v1.3.3@sha256:f26044a2b8085fcaa8146b6b8bb73556134d7ec3d5782c6a04a058c945924ca0" already present on machine
      Normal   Created                 49s   kubelet                  Created container echo-other-node
      Normal   Started                 49s   kubelet                  Started container echo-other-node
      Normal   Pulled                  49s   kubelet                  Container image "docker.io/coredns/coredns:1.10.0@sha256:017727efcfeb7d053af68e51436ce8e65edbc6ca573720afb4f79c8594036955" already present on machine
      Normal   Created                 49s   kubelet                  Created container dns-test-server
      Normal   Started                 49s   kubelet                  Started container dns-test-server
    

Cilium Version

  • v1.13.1
  • v1.14.0-snapshot.0

Kernel Version

Linux ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.25.8-eks-ec5523e

Sysdump

cilium-sysdump-20230418-105354.zip

Relevant log output

No response

Anything else?

I've managed to get the probes working, if I install cilium helm chart with --set endpointRoutes.enabled=false value:

helm upgrade --install cilium cilium/cilium --version 1.13.1 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set cni.exclusive=false \
  --set enableIPv4Masquerade=false \
  --set tunnel=disabled \
  --set endpointRoutes.enabled=false

However, I'm not sure if this is a limitation of deployment in CNI chaining mode (and documentation needs to be updated) or I've simply found a workaround for a bug (that should be fixed).

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions