Communication does not work on EKS with AWS VPC CNI and Cilium deployed in chaining mode for the pods running on the same node when SecuirtyGroups For Pods is activated and applied to one of the communicating pods

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

On EKS cluster with Managed Node Groups,

### Deployment 

1. Activate Security Groups For Pods (SGFP)  following the tutorial described at at https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html#security-groups-pods-deployment. 
AWS CNI Daemonset is configured as desribed in https://docs.cilium.io/en/v1.13/installation/cni-chaining-aws-cni/#enabling-security-groups-for-pods-eks with the following ENV variables relevant to SGFP  :
```bash
# for init  containers
DISABLE_TCP_EARLY_DEMUX=true

#for main container
ENABLE_POD_ENI = true
POD_SECURITY_GROUP_ENFORCING_MODE=standard
AWS_VPC_K8S_CNI_EXTERNALSNAT=true
ENABLE_PREFIX_DELEGATION=false
```		


2. Install Cilium helm chart in VPC CNI chaining mode with the following values as decribed in documentation at https://docs.cilium.io/en/v1.13/installation/cni-chaining-aws-cni/#chaining-aws-cni:
```yaml
cni:
  chainingMode: aws-cni
  exclusive: false

enableIPv4Masquerade: false
tunnel: disabled
		
endpointRoutes:
  enabled: true
```
		
		
3. Apply the taints configuration to the EKS Managed Nodegroups  to ensure applications pods are properly managed by Cilium:  tainted with **node.cilium.io/agent-not-ready=true:NoExecute** to ensure application pods will only be scheduled once Cilium is ready to manage them.
	
4. Run `cilium connectivity test`  - it passes successfully


### Problem

#### Scenario:
- Pods are running on the same EKS worker node.
- Client application  is running as K8s pod with Security Group For Pods  (SGFP) attached based on the SecurityGroupPolicy definition. Attachment is applied via ServiceAccount selector matching the SA used for the pod. Problem is reproduced regardless of whether the Pod has only dedicated SecurityGroup attached and no EKS Worker Node SecurityGroup attached or both (e.g, it uses branch ENI) 
- Service application is also running as K8s pod on the same node and does not have matching SecurityGroupPolicy. As a result, it gets regular EKS Worker Node SecurityGroup applied (e.g, it shares the ENI from the node).
- **Cilium** pod on the node is started before or approximately at the same time as AWS **vpc-cni** pod. This is mostly relevant to the newly created MNG worker nodes based on the cluster scaling events
- The regular application workloads are assigned to this node later when cilium clears the taint from the node.
- Problem is happening for the pods running on the same node when one of them is using regular ENI and other one is using branch ENI.
- Pod with SGFP is able to communicate with the pods  running on other Worker nodes regardless of whether it has matching CiliumNetworkPolicy or not.
- If Cilium network policy is applied to the pod with SGFP attached, connectivity to the pods on other nodes **works** with both L4 and L7 type of rules as well as for DNS based rules but **does not work** for the pod from the same deployment or workload running on the same node
- Problem is fixed if the pods are restarted in order:  AWS VPC-CNI ==> cilium ==> all applicative pods 


#### Observations:
- If the Service pod gets the SecuirtyGroup attached (e.g, it also starts to use branch ENI) and if the rules on those SGPs both at client (egress) and service (ingress) allow the traffic, the connectivity works.
- If we restart the aws vpc-cni and cilium pods in correct order (first aws, wait for it to be ready, than cilium) and later restart all pods that were scheduled  to that node and they are again runnning on it, the connectivity works.
- Problem happens regardless of whether one or both of the pods have Cilium Network Policy applicable to them.
Setting the value `endpointRoutes.enabled=false` fixes problem when using L4-based policies but completely breaks L7 policies.

### Cilium Version

1.13.1

### Kernel Version

5.10.184-175.731.amzn2.x86_64

### Kubernetes Version

Runtime details:
- EKS  ControlPlane version: v1.24.15-eks-a5565ad
- EKS WorkerNode AMI version: 1.24.10-20230304
- Kubelet version: v1.24.10-eks-48e63af
- Container runtime: containerd://1.6.6, containerd://1.6.19
- AWS CNI Version:  1.11.4-eksbuild.1, v1.12.6-eksbuild.1

### Sysdump

[cilium-sysdump-20230729-234124.zip](https://github.com/cilium/cilium/files/12214315/cilium-sysdump-20230729-234124.zip)


### Relevant log output

_No response_

### Anything else?

The following Github issues looks somewhat relevant but not exactly the same:
- https://github.com/cilium/cilium/issues/20677
- https://github.com/cilium/cilium/issues/24943 

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Communication does not work on EKS with AWS VPC CNI and Cilium deployed in chaining mode for the pods running on the same node when SecuirtyGroups For Pods is activated and applied to one of the communicating pods #27152

Is there an existing issue for this?

What happened?

Deployment

Problem

Scenario:

Observations:

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Communication does not work on EKS with AWS VPC CNI and Cilium deployed in chaining mode for the pods running on the same node when SecuirtyGroups For Pods is activated and applied to one of the communicating pods #27152

Description

Is there an existing issue for this?

What happened?

Deployment

Problem

Scenario:

Observations:

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions