Skip to content

All traffic inbound to the cluster fails after upgrading from 1.15 to 1.16 (GKE-only) #35977

@thejosephstevens

Description

@thejosephstevens

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.16.0 and lower than v1.17.0

What happened?

After upgrading Cilium from 1.15 to 1.16 on GKE, all externally exposed services became inaccessible. Traffic inside the cluster was not impacted, and Cilium status was all healthy. This included Primitive LoadBalancer Services, Ingresses, and Gateway API.

How can we reproduce the issue?

  1. Install Cilium with Helm on GKE (we're on 1.30). This should be on a Legacy Datapath cluster (not Dataplane V2)
  2. Values:
agentNotReadyTaintKey: ignore-taint.cluster-autoscaler.kubernetes.io/cilium-agent-not-ready
aksbyocni:
  enabled: false
authentication:
  mutual:
    spire:
      enabled: true
      install:
        enabled: true
        existingNamespace: true
        namespace: kube-system
bpf:
  masquerade: true
cni:
  binPath: /home/kubernetes/bin
devices: eth+
encryption:
  enabled: true
  nodeEncryption: true
  type: wireguard
envoy:
  enabled: true
gatewayAPI:
  enableAlpn: true
  enableAppProtocol: true
  enabled: true
  secretsNamespace:
    create: false
    name: kube-system
hubble:
  listenAddress: :4244
  metrics:
    enabled:
    - dns
    - drop
    - tcp
    - flow
    - icmp
    - http
  relay:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: pool
              operator: In
              values:
              - default
              - control
      podAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              k8s-app: cilium
          topologyKey: kubernetes.io/hostname
    enabled: true
  tls:
    auto:
      method: cronJob
  ui:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: pool
              operator: In
              values:
              - default
              - control
    enabled: true
ingressController:
  enabled: false
ipam:
  mode: kubernetes
k8sServiceHost: REDACTED
k8sServicePort: 443
kubeProxyReplacement: true
l7Proxy: true
loadBalancer:
  serviceTopology: true
localRedirectPolicy: true
nodeinit:
  enabled: true
  reconfigureKubelet: true
  removeCbrBridge: true
operator:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: pool
            operator: In
            values:
            - default
            - control
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            io.cilium/app: operator
        topologyKey: kubernetes.io/hostname
  prometheus:
    enabled: true
prometheus:
  enabled: true
upgradeCompatibility: "1.9"
wellKnownIdentities:
  enabled: true
  1. Create an nginx deployment and a LoadBalancer in front of it, attempt to curl the LoadBalancer IP.

Cilium Version

This was reproduced with 1.16.1 and 1.16.3. Once I downgraded to 1.15.9 the problem immediately went away.

Kernel Version

Linux 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 24 16:19:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.30.5-gke.1014003

This was confirmed on two separate GKE clusters with different external services.

Regression

This is a regression. The exact same config worked on 1.15.9

Sysdump

I'll grab this later, I had to fix the cluster because of a deadline.

Relevant log output

I was not able to find any logs that appeared relevant.

Anything else?

I did find that the external service ports were open (tested with nc -zv), but curl resulted in refused to connect.

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.info-completedThe GH issue has received a reply from the authorkind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions