Skip to content

trafficDistribution failing when no endpoint in current zone #41022

@martin31821

Description

@martin31821

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.18.0 and lower than v1.19.0

What happened?

We've enabled topology aware routing on one of our clusters and annotated our services with trafficDistribution: PreferClose, which, according to the k8s docs, should prefer endpoints in the same zone and fallback to default behavior if no such endpoints exist.

However, we've observed that connections to the k8s service IP fail with ECONNREFUSED when no endpoints in the zones exist.

How can we reproduce the issue?

  • Have a Kubernetes 1.33 cluster with Cilium 1.18.0 (in our case it's EKS, but the bug seems to be independent of that), have nodes in 3 AZs.
  • I've provided some manifests to deploy a test server and a debug shell.
    If both are run with replica=3, everything works as expected, however, when downscaling the server pods to 2 or 1 AZ, only debug shells in the same AZ can connect to a backend.
apiVersion: v1
kind: ConfigMap
metadata:
  name: python-script
data:
  entrypoint.py: |
    import os
    from http.server import BaseHTTPRequestHandler, HTTPServer

    class PodNameHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            pod_name = os.environ.get('POD_NAME', 'unknown-pod')
            self.send_response(200)
            self.send_header('Content-Type', 'text/plain; charset=utf-8')
            self.end_headers()
            self.wfile.write(f"Pod name: {pod_name}\n".encode('utf-8'))

    def run(server_class=HTTPServer, handler_class=PodNameHandler, port=8080):
        server_address = ('', port)
        httpd = server_class(server_address, handler_class)
        print(f"Starting server on port {port}, serving Pod name...")
        httpd.serve_forever()

    if __name__ == '__main__':
        run()

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: python-deployment
  labels:
    app: python
spec:
  replicas: 3
  selector:
    matchLabels:
      app: python
  template:
    metadata:
      labels:
        app: python
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - python
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: python
          image: python:3.13.5-alpine
          command: ["python", "/app/entrypoint.py"]
          ports:
            - containerPort: 8080
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: python-script
              mountPath: /app
      volumes:
        - name: python-script
          configMap:
            name: python-script
---
apiVersion: v1
kind: Service
metadata:
  name: python-service
spec:
  trafficDistribution: PreferClose
  selector:
    app: python
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080
  type: ClusterIP

Cilium Version

1.18.0

Kernel Version

6.12.37-61.105.amzn2023.x86_64

Kubernetes Version

v1.33.3-eks-3abbec1

Regression

no regression

Sysdump

Will add later.

Relevant log output

No logs in cilium relevant to the issue.

Anything else?

Possibly related to #40883

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

affects/mainThis issue affects main branchaffects/v1.18This issue affects v1.18 brancharea/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions