-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Following these instructions as is, including creating EKS cluster with eksctl
, but not including SG for pods: https://docs.cilium.io/en/v1.12/gettingstarted/cni-chaining-aws-cni/#chaining-aws-cni the resulting EKS cluster is functional and connectivity tests pass. However if I create the cluster and install Cilium in exactly the same way, but on Bottlerocket AMI, then all probes fail with timeout, other connectivity is broken too. I can see in conntrack and tcpdump that kubelet sends SYN, but there is no reply from the pod.
There is no packet drop/reject on iptables or in cilium monitor
logs. I was comparing the two clusters side by side and I found diff in cilium status verbose output. On Amazon Linux 2: Host Routing: Legacy
, and on Bottlerocket it is Host Routing: BPF
. Then I created a third cluster (cilium-test-br-lr
), in the same way as the second but with bpf.hostLegacyRouting = true
and probes worked and connectivity tests passed.
For the first two clusters I did not set this flag explicitly and it is false
by default. I assume that Cilium determines in runtime if the system supports BPF and sets the mode accordingly. But it seems that this decision was wrong and something isn't really supported for the BPF host routing to work. That's why I am opening this issue here, but I am happy to continue troubleshooting this to get to the real root cause I just don't now how t proceed. Or is there additional config that I should apply in order to make Bottlerocket cluster work with BPF routing?
Other issues that are related, but not quite the same case or didn't solve this issue:
bottlerocket-os/bottlerocket#1405 - not relevant because node-init is disabled by default and this issue is about minimal repro. The reference cluster (AL2) also ran without node-init.
#15393 - clone of the same issue. The proposed solution bottlerocket-os/bottlerocket#1405 (comment) is exactly what I am doing (was node-init enabled at that time?)
bottlerocket-os/bottlerocket#1367 - there is indeed difference in rf_filter config between Amazon Linux 2 EKS and Bottlerocket, however setting all rp_filters to 0 and restarting sysctl did not solve the issue, the probes were still failing (I validated sysctl -a| grep -w rp_filter
had 0 on all the settings).
Reference cluster cilium-test
:
$ cat install-eks-al2.sh
export NAME="cilium-test"
cat <<EOF >eks-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${NAME}
region: ap-southeast-2
managedNodeGroups:
- name: ng-1
desiredCapacity: 2
privateNetworking: true
# taint nodes so that application pods are
# not scheduled/executed until Cilium is deployed.
# Alternatively, see the note below.
taints:
- key: "node.cilium.io/agent-not-ready "
value: "true"
effect: "NoExecute"
EOF
eksctl create cluster -f ./eks-config.yaml
then install Cilium:
helm install cilium cilium/cilium --version 1.12.0 \
--namespace kube-system \
--set cni.chainingMode=aws-cni \
--set enableIPv4Masquerade=false \
--set tunnel=disabled
Broken cluster, Bootlerocket cilium-test-br
:
$ diff ../test-eks/install-eks-al2.sh install-eks-bottlerocket.sh
1c1
< export NAME="cilium-test"
---
> export NAME="cilium-test-br"
13a14,15
> amiFamily: "Bottlerocket"
> instanceType: r5.large
cilium install - the same.
coredns pods are still pending from cluster creation time and start Running when cilium agent removes the taints. I have also rotated the nodes and restarted pods few times and nothing worked. This is not a coredns issue, it just happens to be the only not hostNetwork workload on the test cluster.
Third cluster, Bottlerocket, with legacy host routing cilium-test-br-lr
, EKS is deployed in the same way as for cilium-test-br
, for cilium install added bpf.hostLegacyRouting=true
helm install cilium cilium/cilium --version 1.12.0 \
--namespace kube-system \
--set cni.chainingMode=aws-cni \
--set enableIPv4Masquerade=false \
--set bpf.hostLegacyRouting=true \
--set tunnel=disabled
On the first and third cluster all pods are healthy and readiness/liveness probes work, connectivity tests pass (two tests fail, but pass manually, presumably cilium-cli bug there is already similar issue with exactly these 2 test). On the second cluster probes don't work, and curling from pod to another pod ip (on the same or another node) does not work.
on second cluster:
$ k get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-47fdz 1/1 Running 0 26h
kube-system aws-node-8xjbm 1/1 Running 0 26h
kube-system cilium-bcnnh 1/1 Running 0 26h
kube-system cilium-gl2jf 1/1 Running 0 26h
kube-system cilium-operator-598c495f5f-665d5 1/1 Running 0 26h
kube-system cilium-operator-598c495f5f-wlpvh 1/1 Running 0 26h
kube-system coredns-964b95965-969rf 0/1 Running 359 (79s ago) 26h
kube-system coredns-964b95965-xnsck 0/1 CrashLoopBackOff 107 (35s ago) 7h36m
kube-system kube-proxy-7mpv8 1/1 Running 0 26h
kube-system kube-proxy-bpptv 1/1 Running 0 26h
Cilium Version
$ cilium version
cilium-cli: 0.11.11 compiled with go1.18.3 on darwin/amd64
cilium image (default): v1.11.6
cilium image (stable): v1.12.0
cilium image (running): v1.12.0
but happens with 1.11.6 too.
Kernel Version
Bottlerocket cluster (amazon/bottlerocket-aws-k8s-1.22-x86_64-v1.8.0-a6233c22):
$ uname -a
Linux ip-192-168-106-102.ap-southeast-2.compute.internal 5.10.118 #1 SMP Thu Jun 9 01:24:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Reference cluster (amazon/amazon-eks-node-1.22-v20220629)
$ uname -a
Linux ip-192-168-112-230.ap-southeast-2.compute.internal 5.4.196-108.356.amzn2.x86_64 #1 SMP Thu May 26 12:49:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
k version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.1
Kustomize Version: v4.5.4
Server Version: v1.22.10-eks-84b4fe6
WARNING: version difference between client (1.24) and server (1.22) exceeds the supported minor version skew of +/-1
Sysdump
cilium-sysdump-20220728-175804.zip
Relevant log output
# Note that these IPs can be different from the ones in sysdump because I rotated pods/nodes few times.
# Probes from kubelet to a pod remain in UNREPLIED
bash-5.1# conntrack -E -d 192.168.149.81
[NEW] tcp 6 120 SYN_SENT src=192.168.154.54 dst=192.168.149.81 sport=44354 dport=8080 [UNREPLIED] src=192.168.149.81 dst=192.168.154.54 sport=8080 dport=44354
[NEW] tcp 6 120 SYN_SENT src=192.168.154.54 dst=192.168.149.81 sport=44356 dport=8080 [UNREPLIED] src=192.168.149.81 dst=192.168.154.54 sport=8080 dport=44356
[NEW] tcp 6 120 SYN_SENT src=192.168.154.54 dst=192.168.149.81 sport=57922 dport=8080 [UNREPLIED] src=192.168.149.81 dst=192.168.154.54 sport=8080 dport=57922
## tcpdump example (host: `192.168.106.102`, pod: `192.168.108.32`):
03:42:43.607849 IP 192.168.106.102.39394 > 192.168.108.32.8080: Flags [S], seq 1650401920, win 62727, options [mss 8961,sackOK,TS val 2146721601 ecr 0,nop,wscale 7], length 0
03:42:43.607849 IP 192.168.106.102.39396 > 192.168.108.32.8080: Flags [S], seq 3000395580, win 62727, options [mss 8961,sackOK,TS val 2146721601 ecr 0,nop,wscale 7], length 0
03:42:43.607878 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382368077 ecr 2146721601,nop,wscale 7], length 0
03:42:43.607878 IP 192.168.108.32.8080 > 192.168.106.102.39394: Flags [S.], seq 1269762911, ack 1650401921, win 62643, options [mss 8961,sackOK,TS val 2382368077 ecr 2146721601,nop,wscale 7], length 0
03:42:43.607890 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382368077 ecr 2146721601,nop,wscale 7], length 0
03:42:43.607890 IP 192.168.108.32.8080 > 192.168.106.102.39394: Flags [S.], seq 1269762911, ack 1650401921, win 62643, options [mss 8961,sackOK,TS val 2382368077 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616940 IP 192.168.108.32.8080 > 192.168.106.102.39394: Flags [S.], seq 1269762911, ack 1650401921, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616951 IP 192.168.106.102.39396 > 192.168.108.32.8080: Flags [S], seq 3000395580, win 62727, options [mss 8961,sackOK,TS val 2146722610 ecr 0,nop,wscale 7], length 0
03:42:44.616959 IP 192.168.108.32.8080 > 192.168.106.102.39394: Flags [S.], seq 1269762911, ack 1650401921, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616983 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616986 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616988 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
03:42:44.616991 IP 192.168.108.32.8080 > 192.168.106.102.39396: Flags [S.], seq 183656348, ack 3000395581, win 62643, options [mss 8961,sackOK,TS val 2382369086 ecr 2146721601,nop,wscale 7], length 0
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct