Skip to content

CI: ConformanceEKS/installation-and-connectivity/Install Cilium: Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded: Error: Process completed with exit code 1 #24774

@christarazi

Description

@christarazi

CI failure

Background

Investigating the sysdump, we see that the coredns pods are not in the running state and are instead in pending. In EKS, Cilium is running with ENI IPAM mode which requires the Operator to connect to the AWS API to pull down the ENI information to populate the CiliumNode resource, which the Agent uses for IPAM. According to the Operator logs we see,

2023-04-05T15:05:46.388096278Z level=warning msg="Unable to synchronize EC2 VPC list" error="operation error EC2: DescribeVpcs, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp: lookup ec2.us-east-2.amazonaws.com on 10.100.0.10:53: write udp 192.168.97.147:60949->10.100.0.10:53: write: operation not permitted" subsys=eni
2023-04-05T15:05:46.388128823Z level=fatal msg="Unable to start eni allocator" error="Initial synchronization with instances API failed" subsys=cilium-operator-aws

which is likely because the DNS resolution failed due to the connection to 10.100.0.10 which failed because it's to the kube-dns service IP. Due to the coredns pods being in pending, there are no backends for the service.

This causes the Agent to hang for ~5m, until kubelet kicks in and restart Cilium:

2023-04-05T15:04:31.189152528Z level=info msg="Waiting for IPs to become available in CRD-backed allocation pool" available=0 helpMessage="Check if cilium-operator pod is running and does not have any warnings or error messages." name=ip-192-168-97-147.us-east-2.compute.internal required=2 subsys=ipam

Here, we have a chicken and egg problem if DNS is not available.

Output

Run cilium install --cluster-name=cilium-cilium-4619701977 --chart-directory=install/kubernetes/cilium --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --wait=false --rollback=false --config monitor-aggregation=none --version=
Flag --cluster-name has been deprecated, This can now be overridden via `helm-set` (Helm value: `cluster.name`).
🔮 Auto-detected Kubernetes kind: EKS
ℹ️  Using Cilium version 1.13.90
🔮 Auto-detected datapath mode: aws-eni
🔮 Auto-detected kube-proxy has been installed
ℹ️  helm template --namespace kube-system cilium "install/kubernetes/cilium" --version 1.13.90 --set cluster.id=0,cluster.name=cilium-cilium-4619701977,clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci,clustermesh.apiserver.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,clustermesh.apiserver.image.useDigest=false,egressMasqueradeInterfaces=eth0,encryption.nodeEncryption=false,eni.enabled=true,extraConfig.monitor-aggregation=none,hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci,hubble.relay.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,image.repository=quay.io/cilium/cilium-ci,image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,image.useDigest=false,ipam.mode=eni,kubeProxyReplacement=disabled,loadBalancer.l7.backend=envoy,nodeinit.enabled=true,operator.image.repository=quay.io/cilium/operator,operator.image.suffix=-ci,operator.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,operator.image.useDigest=false,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator,tls.secretsBackend=k8s,tunnel=disabled
ℹ️  Storing helm values file in kube-system/cilium-cli-helm-values Secret
🚀 Creating ConfigMap for Cilium version 1.13.90...
ℹ️  Manual overwrite in ConfigMap: monitor-aggregation=none
🔑 Created CA in secret cilium-ca
🔑 Generating certificates for Hubble...
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap for Cilium version 1.13.90...
ℹ️  Manual overwrite in ConfigMap: monitor-aggregation=none
🚀 Creating EKS Node Init DaemonSet...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed and ready...

    /¯¯\
Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded
 /¯¯\__/¯¯\    Cilium:          3 errors
 \__/¯¯\__/    Operator:        OK
 /¯¯\__/¯¯\    Hubble Relay:    disabled
 \__/¯¯\__/    ClusterMesh:     disabled
    \__/

DaemonSet         cilium             Desired: 2, Unavailable: 2/2
Deployment        cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
Containers:       cilium             Running: 2
                  cilium-operator    Running: 1
Cluster Pods:     0/2 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium-ci:3e5fd58ca2f32adb7ffedca07b439f8c525beeff: 2
                  cilium-operator    quay.io/cilium/operator-aws-ci:3e5fd58ca2f32adb7ffedca07b439f8c525beeff: 1
Errors:           cilium             cilium-hcqt8    unable to retrieve cilium status: command terminated with exit code 1
                  cilium             cilium-cs8zq    unable to retrieve cilium status: command terminated with exit code 1
                  cilium             cilium          2 pods of DaemonSet cilium are not ready
ℹ️  Rollback disabled with '--rollback=false', leaving installed resources behind
Error: Process completed with exit code 1.

cilium-sysdump-final.zip

Potential solutions

(Thanks to @chancez, @marseel, and @joestringer for brainstorming)

  1. Modify Operator to use node's DNS as it depends on cluster DNS today
  2. Ensure in-cluster DNS pods are ready before installing Cilium
  3. Make Operator resolve the AWS API with a different resolver/conf

Option (1) only works if the Operator doesn't need in cluster dns. AFAIK, the Operator talks to

  • K8s
  • kvstore
  • cloud-managed API for IPAM (aka AWS)

Metadata

Metadata

Assignees

Labels

area/CIContinuous Integration testing issue or flakearea/agentCilium agent related.area/operatorImpacts the cilium-operator componentci/flakeThis is a known failure that occurs in the tree. Please investigate me!integration/cloudRelated to integration with cloud environments such as AKS, EKS, GKE, etc.kind/bugThis is a bug in the Cilium logic.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions