-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
CI failure
Background
Investigating the sysdump, we see that the coredns pods are not in the running
state and are instead in pending
. In EKS, Cilium is running with ENI IPAM mode which requires the Operator to connect to the AWS API to pull down the ENI information to populate the CiliumNode resource, which the Agent uses for IPAM. According to the Operator logs we see,
2023-04-05T15:05:46.388096278Z level=warning msg="Unable to synchronize EC2 VPC list" error="operation error EC2: DescribeVpcs, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp: lookup ec2.us-east-2.amazonaws.com on 10.100.0.10:53: write udp 192.168.97.147:60949->10.100.0.10:53: write: operation not permitted" subsys=eni
2023-04-05T15:05:46.388128823Z level=fatal msg="Unable to start eni allocator" error="Initial synchronization with instances API failed" subsys=cilium-operator-aws
which is likely because the DNS resolution failed due to the connection to 10.100.0.10
which failed because it's to the kube-dns
service IP. Due to the coredns pods being in pending
, there are no backends for the service.
This causes the Agent to hang for ~5m, until kubelet kicks in and restart Cilium:
2023-04-05T15:04:31.189152528Z level=info msg="Waiting for IPs to become available in CRD-backed allocation pool" available=0 helpMessage="Check if cilium-operator pod is running and does not have any warnings or error messages." name=ip-192-168-97-147.us-east-2.compute.internal required=2 subsys=ipam
Here, we have a chicken and egg problem if DNS is not available.
Output
Run cilium install --cluster-name=cilium-cilium-4619701977 --chart-directory=install/kubernetes/cilium --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --wait=false --rollback=false --config monitor-aggregation=none --version=
Flag --cluster-name has been deprecated, This can now be overridden via `helm-set` (Helm value: `cluster.name`).
🔮 Auto-detected Kubernetes kind: EKS
ℹ️ Using Cilium version 1.13.90
🔮 Auto-detected datapath mode: aws-eni
🔮 Auto-detected kube-proxy has been installed
ℹ️ helm template --namespace kube-system cilium "install/kubernetes/cilium" --version 1.13.90 --set cluster.id=0,cluster.name=cilium-cilium-4619701977,clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci,clustermesh.apiserver.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,clustermesh.apiserver.image.useDigest=false,egressMasqueradeInterfaces=eth0,encryption.nodeEncryption=false,eni.enabled=true,extraConfig.monitor-aggregation=none,hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci,hubble.relay.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,image.repository=quay.io/cilium/cilium-ci,image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,image.useDigest=false,ipam.mode=eni,kubeProxyReplacement=disabled,loadBalancer.l7.backend=envoy,nodeinit.enabled=true,operator.image.repository=quay.io/cilium/operator,operator.image.suffix=-ci,operator.image.tag=3e5fd58ca2f32adb7ffedca07b439f8c525beeff,operator.image.useDigest=false,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator,tls.secretsBackend=k8s,tunnel=disabled
ℹ️ Storing helm values file in kube-system/cilium-cli-helm-values Secret
🚀 Creating ConfigMap for Cilium version 1.13.90...
ℹ️ Manual overwrite in ConfigMap: monitor-aggregation=none
🔑 Created CA in secret cilium-ca
🔑 Generating certificates for Hubble...
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap for Cilium version 1.13.90...
ℹ️ Manual overwrite in ConfigMap: monitor-aggregation=none
🚀 Creating EKS Node Init DaemonSet...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed and ready...
/¯¯\
Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded
/¯¯\__/¯¯\ Cilium: 3 errors
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Hubble Relay: disabled
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 2, Unavailable: 2/2
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 2
cilium-operator Running: 1
Cluster Pods: 0/2 managed by Cilium
Image versions cilium quay.io/cilium/cilium-ci:3e5fd58ca2f32adb7ffedca07b439f8c525beeff: 2
cilium-operator quay.io/cilium/operator-aws-ci:3e5fd58ca2f32adb7ffedca07b439f8c525beeff: 1
Errors: cilium cilium-hcqt8 unable to retrieve cilium status: command terminated with exit code 1
cilium cilium-cs8zq unable to retrieve cilium status: command terminated with exit code 1
cilium cilium 2 pods of DaemonSet cilium are not ready
ℹ️ Rollback disabled with '--rollback=false', leaving installed resources behind
Error: Process completed with exit code 1.
- https://github.com/cilium/cilium/actions/runs/4619701977/jobs/8168912428
- https://github.com/cilium/cilium/actions/runs/4618219691/jobs/8172578395
Potential solutions
(Thanks to @chancez, @marseel, and @joestringer for brainstorming)
- Modify Operator to use node's DNS as it depends on cluster DNS today
- Ensure in-cluster DNS pods are ready before installing Cilium
- Make Operator resolve the AWS API with a different resolver/conf
Option (1) only works if the Operator doesn't need in cluster dns. AFAIK, the Operator talks to
- K8s
- kvstore
- cloud-managed API for IPAM (aka AWS)