Skip to content

Communication does not work on EKS with AWS VPC CNI and Cilium deployed in chaining mode for the pods running on the same node when SecuirtyGroups For Pods is activated and applied to one of the communicating pods #27152

@GolubevV

Description

@GolubevV

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

On EKS cluster with Managed Node Groups,

Deployment

  1. Activate Security Groups For Pods (SGFP) following the tutorial described at at https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html#security-groups-pods-deployment.
    AWS CNI Daemonset is configured as desribed in https://docs.cilium.io/en/v1.13/installation/cni-chaining-aws-cni/#enabling-security-groups-for-pods-eks with the following ENV variables relevant to SGFP :
# for init  containers
DISABLE_TCP_EARLY_DEMUX=true

#for main container
ENABLE_POD_ENI = true
POD_SECURITY_GROUP_ENFORCING_MODE=standard
AWS_VPC_K8S_CNI_EXTERNALSNAT=true
ENABLE_PREFIX_DELEGATION=false
  1. Install Cilium helm chart in VPC CNI chaining mode with the following values as decribed in documentation at https://docs.cilium.io/en/v1.13/installation/cni-chaining-aws-cni/#chaining-aws-cni:
cni:
  chainingMode: aws-cni
  exclusive: false

enableIPv4Masquerade: false
tunnel: disabled
		
endpointRoutes:
  enabled: true
  1. Apply the taints configuration to the EKS Managed Nodegroups to ensure applications pods are properly managed by Cilium: tainted with node.cilium.io/agent-not-ready=true:NoExecute to ensure application pods will only be scheduled once Cilium is ready to manage them.

  2. Run cilium connectivity test - it passes successfully

Problem

Scenario:

  • Pods are running on the same EKS worker node.
  • Client application is running as K8s pod with Security Group For Pods (SGFP) attached based on the SecurityGroupPolicy definition. Attachment is applied via ServiceAccount selector matching the SA used for the pod. Problem is reproduced regardless of whether the Pod has only dedicated SecurityGroup attached and no EKS Worker Node SecurityGroup attached or both (e.g, it uses branch ENI)
  • Service application is also running as K8s pod on the same node and does not have matching SecurityGroupPolicy. As a result, it gets regular EKS Worker Node SecurityGroup applied (e.g, it shares the ENI from the node).
  • Cilium pod on the node is started before or approximately at the same time as AWS vpc-cni pod. This is mostly relevant to the newly created MNG worker nodes based on the cluster scaling events
  • The regular application workloads are assigned to this node later when cilium clears the taint from the node.
  • Problem is happening for the pods running on the same node when one of them is using regular ENI and other one is using branch ENI.
  • Pod with SGFP is able to communicate with the pods running on other Worker nodes regardless of whether it has matching CiliumNetworkPolicy or not.
  • If Cilium network policy is applied to the pod with SGFP attached, connectivity to the pods on other nodes works with both L4 and L7 type of rules as well as for DNS based rules but does not work for the pod from the same deployment or workload running on the same node
  • Problem is fixed if the pods are restarted in order: AWS VPC-CNI ==> cilium ==> all applicative pods

Observations:

  • If the Service pod gets the SecuirtyGroup attached (e.g, it also starts to use branch ENI) and if the rules on those SGPs both at client (egress) and service (ingress) allow the traffic, the connectivity works.
  • If we restart the aws vpc-cni and cilium pods in correct order (first aws, wait for it to be ready, than cilium) and later restart all pods that were scheduled to that node and they are again runnning on it, the connectivity works.
  • Problem happens regardless of whether one or both of the pods have Cilium Network Policy applicable to them.
    Setting the value endpointRoutes.enabled=false fixes problem when using L4-based policies but completely breaks L7 policies.

Cilium Version

1.13.1

Kernel Version

5.10.184-175.731.amzn2.x86_64

Kubernetes Version

Runtime details:

  • EKS ControlPlane version: v1.24.15-eks-a5565ad
  • EKS WorkerNode AMI version: 1.24.10-20230304
  • Kubelet version: v1.24.10-eks-48e63af
  • Container runtime: containerd://1.6.6, containerd://1.6.19
  • AWS CNI Version: 1.11.4-eksbuild.1, v1.12.6-eksbuild.1

Sysdump

cilium-sysdump-20230729-234124.zip

Relevant log output

No response

Anything else?

The following Github issues looks somewhat relevant but not exactly the same:

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions