Kubelet probes fail for EKS pods with AWS security groups when cilium is deployed in CNI chaining mode with AWS VPC CNI.

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

1. Create EKS cluster on AWS following [this guide](https://docs.cilium.io/en/v1.13/installation/k8s-install-helm/) via `eksctl`:
    ```
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig

    metadata:
      name: vgrygoruk-2935
      region: eu-west-1

    managedNodeGroups:
    - name: ng-1
      desiredCapacity: 2
      privateNetworking: true
      taints:
      - key: "node.cilium.io/agent-not-ready"
        value: "true"
        effect: "NoExecute"
    ```
1. Install "AWS VPC CNI", "CoreDNS" and "kube-proxy" EKS add-ons.
1. Install cilium in "CNI Chaining" mode onto a AWS EKS cluster following [official documentation](https://docs.cilium.io/en/v1.13/installation/cni-chaining-aws-cni/).
    ```
    helm install cilium cilium/cilium --version 1.13.2 \
    --namespace kube-system \
    --set cni.chainingMode=aws-cni \
    --set cni.exclusive=false \
    --set enableIPv4Masquerade=false \
    --set tunnel=disabled \
    --set endpointRoutes.enabled=true
    ```
1. Run `cilium connectivity test`
1. Create SecurityGroupPolicy to attach pod security groups to `echo` pods in `cilium-test` namespace (as they have probes defined):
    ```
    ---
    apiVersion: vpcresources.k8s.aws/v1beta1
    kind: SecurityGroupPolicy
    metadata:
      name: test-app-psgp
      namespace: cilium-test
    spec:
      podSelector:
        matchLabels:
          kind: echo
      securityGroups:
        groupIds:
          - sg-08b53279c80ec19d9
    ```
1. Delete `echo-other-node-*` pod in `cilium-test` namespace, so new pod uses it's own ENI with security group attached to it: 
    ```
        Events:
      Type     Reason                  Age                    From                     Message
      ----     ------                  ----                   ----                     -------
      Normal   Scheduled               18m                    default-scheduler        Successfully assigned cilium-test/echo-other-node-78f77b57f8-lg8xg to ip-192-168-120-241.eu-west-1.compute.internal
      Normal   SecurityGroupRequested  18m                    vpc-resource-controller  Pod will get the following Security Groups [sg-08b53279c80ec19d9]
      Warning  FailedCreatePodSandBox  18m                    kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7dedc1e9a5dd13278502f2592cdc8e8d82276dab1f8aad1bbcb6380
    f72b8f415": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
      Normal   ResourceAllocated       18m                    vpc-resource-controller  Allocated [{"eniId":"eni-0b923d932ea2fe0af","ifAddress":"06:a4:ac:cd:ce:1b","privateIp":"192.168.103.13","vlanId":3,"subnetCidr":"192.168.96.0/19"}] to the
    pod
      Normal   Pulled                  18m                    kubelet                  Container image "quay.io/cilium/json-mock:v1.3.3@sha256:f26044a2b8085fcaa8146b6b8bb73556134d7ec3d5782c6a04a058c945924ca0" already present on machine
      Normal   Created                 18m                    kubelet                  Created container echo-other-node
      Normal   Started                 18m                    kubelet                  Started container echo-other-node
      Normal   Pulled                  18m                    kubelet                  Container image "docker.io/coredns/coredns:1.10.0@sha256:017727efcfeb7d053af68e51436ce8e65edbc6ca573720afb4f79c8594036955" already present on machine
      Normal   Created                 18m                    kubelet                  Created container dns-test-server
      Normal   Started                 18m                    kubelet                  Started container dns-test-server
      Warning  Unhealthy               18m (x9 over 18m)      kubelet                  Readiness probe failed: Get "http://192.168.103.13:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      Warning  Unhealthy               3m38s (x448 over 18m)  kubelet                  Readiness probe failed: Get "http://192.168.103.13:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    ``` 

**Expected result**:  pod `echo-other-node-*` is running successfully. kubelet probes are not failing.
**Actual result**: `echo-other-node-*` is started, but kubelet probes are failing.

Notes: 
- security group references in security group policy allows ALL traffic on all ports (both ingress and egress) from/to VPC CIDR, so the issue is not related to rules in AWS Security groups. 
- `echo-other-node-*` pod is running successfully, if I uninstall `cilium` from the cluster and re-created the pod.
    ```
    helm uninstall cilium --namespace kube-system
    # re-create the pod and inspect pod events:
        Events:
      Type     Reason                  Age   From                     Message
      ----     ------                  ----  ----                     -------
      Normal   Scheduled               50s   default-scheduler        Successfully assigned cilium-test/echo-other-node-78f77b57f8-rz78p to ip-192-168-120-241.eu-west-1.compute.internal
      Normal   SecurityGroupRequested  50s   vpc-resource-controller  Pod will get the following Security Groups [sg-08b53279c80ec19d9]
      Warning  FailedCreatePodSandBox  50s   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d54d00652c58ec0ff2b08b11e0e72227f6c7bade61f1b1ff8224915bce227943": plugi
    n type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
      Normal   ResourceAllocated       49s   vpc-resource-controller  Allocated [{"eniId":"eni-0ba69eee6c7826dfe","ifAddress":"06:82:8f:17:be:e7","privateIp":"192.168.112.70","vlanId":1,"subnetCidr":"192.168.96.0/19"}] to the pod
      Normal   Pulled                  49s   kubelet                  Container image "quay.io/cilium/json-mock:v1.3.3@sha256:f26044a2b8085fcaa8146b6b8bb73556134d7ec3d5782c6a04a058c945924ca0" already present on machine
      Normal   Created                 49s   kubelet                  Created container echo-other-node
      Normal   Started                 49s   kubelet                  Started container echo-other-node
      Normal   Pulled                  49s   kubelet                  Container image "docker.io/coredns/coredns:1.10.0@sha256:017727efcfeb7d053af68e51436ce8e65edbc6ca573720afb4f79c8594036955" already present on machine
      Normal   Created                 49s   kubelet                  Created container dns-test-server
      Normal   Started                 49s   kubelet                  Started container dns-test-server
    ```

### Cilium Version

- v1.13.1
- v1.14.0-snapshot.0

### Kernel Version

```
Linux ip-XXX-XXX-XXX-XXX.eu-west-1.compute.internal 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
```

### Kubernetes Version

v1.25.8-eks-ec5523e

### Sysdump

[cilium-sysdump-20230418-105354.zip](https://github.com/cilium/cilium/files/11259905/cilium-sysdump-20230418-105354.zip)


### Relevant log output

_No response_

### Anything else?

I've managed to get the probes working, if I install cilium helm chart with `--set endpointRoutes.enabled=false` value:

```
helm upgrade --install cilium cilium/cilium --version 1.13.1 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set cni.exclusive=false \
  --set enableIPv4Masquerade=false \
  --set tunnel=disabled \
  --set endpointRoutes.enabled=false
```

However, I'm not sure if this is a limitation of deployment in CNI chaining mode (and documentation needs to be updated) or I've simply found a workaround for a bug (that should be fixed).

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubelet probes fail for EKS pods with AWS security groups when cilium is deployed in CNI chaining mode with AWS VPC CNI. #24943

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kubelet probes fail for EKS pods with AWS security groups when cilium is deployed in CNI chaining mode with AWS VPC CNI. #24943

Description

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions