Skip to content

cilium-operator 1.17.3 fails to start on EKS w/ ENI allocator #39106

@mhulscher

Description

@mhulscher

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.17.3 and lower than v1.18.0

What happened?

Upgrading the cilium helm-chart from 1.17.2 to 1.17.3 with the exact same values results in the cilium-operator crashing with the following error:

time="2025-04-23T09:49:25.524358585Z" level=info msg="Starting ENI allocator..." subsys=ipam-allocator-aws
time="2025-04-23T09:49:25.796209131Z" level=warning msg="Unable to synchronize EC2 interface list" error="operation error EC2: DescribeNetworkInterfaces, https response error StatusCode: 400, RequestID: 6a455e57-a9f6-4524-aadc-ce3ede4f490a, api error InvalidParameterCombination: The parameter NetworkInterfaceIds cannot be used with the parameter MaxResults" subsys=eni

How can we reproduce the issue?

Install cilium with the following values.yaml:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: eks.amazonaws.com/compute-type
          operator: NotIn
          values:
          - fargate
agentNotReadyTaintKey: startup-taint.cluster-autoscaler.kubernetes.io/cilium-not-ready
bandwidthManager:
  bbr: true
  enabled: true
bpf:
  masquerade: false
  tproxy: true
bpfClockProbe: true
certgen:
  image:
    repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/certgen
cluster:
  name: aio-gfpw
clustermesh:
  apiserver:
    image:
      repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/clustermesh-apiserver
cni:
  chainingMode: none
dnsProxy:
  endpointMaxIpPerHostname: 4000
  minTtl: 0
enableIPv4BIGTCP: true
enableIPv4Masquerade: false
enableIPv6BIGTCP: true
enableIPv6Masquerade: false
encryption:
  enabled: false
  nodeEncryption: true
  type: wireguard
endpointHealthChecking:
  enabled: false
eni:
  awsEnablePrefixDelegation: true
  ec2APIEndpoint: ec2.eu-west-1.amazonaws.com
  enabled: true
  eniTags: {}
  iamRole: arn:aws:iam::123456789012:role/aio-gfpw-cilium-operator
  instanceTagsFilter:
  - aws:eks:cluster-name=aio-gfpw
  updateEC2AdapterLimitViaAPI: true
envoy:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: eks.amazonaws.com/compute-type
            operator: NotIn
            values:
            - fargate
  enabled: true
  image:
    repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/cilium-envoy
  nodeSelector:
    kubernetes.io/os: linux
  priorityClassName: system-node-critical
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: false
      labels:
        system: "true"
  resources:
    limits: null
    requests:
      cpu: 50m
      memory: 100Mi
  rollOutPods: true
  tolerations:
  - operator: Exists
  updateStrategy:
    type: OnDelete
healthChecking: false
hubble:
  enabled: true
  eventBufferCapacity: "8191"
  metrics:
    enableOpenMetrics: true
    enabled:
    - dns:query;labelsContext=source_namespace,source_workload
    - httpV2:exemplars=true;labelsContext=source_namespace,source_workload,source_app,destination_namespace,destination_workload,destination_app,traffic_direction
    serviceMonitor:
      enabled: false
      labels:
        system: "true"
  relay:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: k8s-app
              operator: In
              values:
              - hubble-relay
          topologyKey: topology.kubernetes.io/zone
    enabled: true
    image:
      repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/hubble-relay
    podDisruptionBudget:
      enabled: true
      maxUnavailable: 1
    replicas: 3
    resources:
      requests:
        cpu: 25m
    rollOutPods: true
    updateStrategy:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 1
      type: RollingUpdate
  ui:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: k8s-app
              operator: In
              values:
              - hubble-ui
          topologyKey: topology.kubernetes.io/zone
    backend:
      image:
        repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/hubble-ui-backend
    enabled: true
    frontend:
      image:
        repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/hubble-ui
    ingress:
      className: ingress-nginx
      enabled: true
      hosts:
      - hubble.aio-gfpw.aws.example.com
    podDisruptionBudget:
      enabled: true
      maxUnavailable: 1
    replicas: 1
    resources:
      requests:
        cpu: 25m
    rollOutPods: true
    updateStrategy:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 1
      type: RollingUpdate
image:
  repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/cilium
ipam:
  mode: eni
k8sServiceHost: 51B4364E34F9C9DD6668F765127898E1.gr7.eu-west-1.eks.amazonaws.com
k8sServicePort: 443
kubeProxyReplacement: true
l7Proxy: true
labels: k8s:!job-name k8s:!controller-uid
loadBalancer:
  l7:
    backend: envoy
  serviceTopology: true
localRedirectPolicy: true
nodeinit:
  image:
    repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/startup-script
operator:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: name
            operator: In
            values:
            - cilium-operator
        topologyKey: topology.kubernetes.io/zone
  extraArgs:
  - --unmanaged-pod-watcher-interval=0
  image:
    repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/operator
  podDisruptionBudget:
    enabled: true
    maxUnavailable: 1
  priorityClassName: system-cluster-critical
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: false
      labels:
        system: "true"
  replicas: 2
  resources:
    requests:
      cpu: 25m
  rollOutPods: true
  tolerations:
  - key: node.kubernetes.io/not-ready
    operator: Exists
  - key: startup-taint.cluster-autoscaler.kubernetes.io/cilium-not-ready
    operator: Exists
  - key: startup-taint.cluster-autoscaler.kubernetes.io/dns-not-ready
    operator: Exists
  - key: efs.csi.aws.com/agent-not-ready
    operator: Exists
pmtuDiscovery:
  enabled: true
policyEnforcementMode: default
preflight:
  image:
    repository: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/aio-gfpw/quay.io/cilium/cilium
priorityClassName: system-node-critical
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
    labels:
      system: "true"
    metricRelabelings:
    - action: keep
      regex: cilium_operator_ces_sync_errors_total|cilium_controllers_failing|cilium_errors_warnings_total|cilium_ipcache_errors_total|cilium_policy_import_errors_total|cilium_policy_l7_parse_errors_total|cilium_bpf_map_pressure
      sourceLabels:
      - __name__
resources:
  limits: null
  requests:
    cpu: 50m
    memory: 300Mi
routingMode: native
socketLB:
  enabled: true
  terminatePodConnections: true
svcSourceRangeCheck: false
updateStrategy:
  type: OnDelete

Cilium Version

1.17.3

Kernel Version

Bottlerocket OS 1.36.0 (aws-k8s-1.31) 6.1.131

Kubernetes Version

Client Version: v1.32.4
Server Version: v1.31.7-eks-bcf3d70

Regression

1.17.2

Sysdump

No response

Relevant log output

Anything else?

Rolling back to 1.17.2 immediately fixes the problem.

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions