-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.16.0 and lower than v1.17.0
What happened?
After upgrading Cilium from 1.15 to 1.16 on GKE, all externally exposed services became inaccessible. Traffic inside the cluster was not impacted, and Cilium status was all healthy. This included Primitive LoadBalancer Services, Ingresses, and Gateway API.
How can we reproduce the issue?
- Install Cilium with Helm on GKE (we're on 1.30). This should be on a Legacy Datapath cluster (not Dataplane V2)
- Values:
agentNotReadyTaintKey: ignore-taint.cluster-autoscaler.kubernetes.io/cilium-agent-not-ready
aksbyocni:
enabled: false
authentication:
mutual:
spire:
enabled: true
install:
enabled: true
existingNamespace: true
namespace: kube-system
bpf:
masquerade: true
cni:
binPath: /home/kubernetes/bin
devices: eth+
encryption:
enabled: true
nodeEncryption: true
type: wireguard
envoy:
enabled: true
gatewayAPI:
enableAlpn: true
enableAppProtocol: true
enabled: true
secretsNamespace:
create: false
name: kube-system
hubble:
listenAddress: :4244
metrics:
enabled:
- dns
- drop
- tcp
- flow
- icmp
- http
relay:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: pool
operator: In
values:
- default
- control
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
k8s-app: cilium
topologyKey: kubernetes.io/hostname
enabled: true
tls:
auto:
method: cronJob
ui:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: pool
operator: In
values:
- default
- control
enabled: true
ingressController:
enabled: false
ipam:
mode: kubernetes
k8sServiceHost: REDACTED
k8sServicePort: 443
kubeProxyReplacement: true
l7Proxy: true
loadBalancer:
serviceTopology: true
localRedirectPolicy: true
nodeinit:
enabled: true
reconfigureKubelet: true
removeCbrBridge: true
operator:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: pool
operator: In
values:
- default
- control
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
io.cilium/app: operator
topologyKey: kubernetes.io/hostname
prometheus:
enabled: true
prometheus:
enabled: true
upgradeCompatibility: "1.9"
wellKnownIdentities:
enabled: true
- Create an nginx deployment and a LoadBalancer in front of it, attempt to curl the LoadBalancer IP.
Cilium Version
This was reproduced with 1.16.1 and 1.16.3. Once I downgraded to 1.15.9 the problem immediately went away.
Kernel Version
Linux 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 24 16:19:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.30.5-gke.1014003
This was confirmed on two separate GKE clusters with different external services.
Regression
This is a regression. The exact same config worked on 1.15.9
Sysdump
I'll grab this later, I had to fix the cluster because of a deadline.
Relevant log output
I was not able to find any logs that appeared relevant.
Anything else?
I did find that the external service ports were open (tested with nc -zv
), but curl resulted in refused to connect
.
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct