Skip to content

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Aug 6, 2025

This fixes handling of terminating endpoints that are marked as not serving. These were skipped by ParseEndpointSliceV1 which lead to them being removed from the backend BPF maps and disrupting connections too early when a terminating pod stopped signalling readiness (conditions became {ready: false, serving: false, terminating: true}.

To fix this we:

  1. introduce BackendStateTerminatingNotServing for terminating backends that are to be kept in BPF maps, but not load-balanced to.
  2. remove the skipping of non-serving backends from ParseEndpointSliceV1. In case the conditions are the unexpected {ready: false, serving: false, terminating: false} the backend is simply quarantined.
Kubernetes endpoints that are terminating are retained in the backends BPF state regardless of the "serving" condition to avoid connection disruptions when a pod no longer signals readiness to process new connections.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 6, 2025
@joamaki
Copy link
Contributor Author

joamaki commented Aug 6, 2025

/test

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch 6 times, most recently from 4ab9dc9 to 3a56eac Compare August 6, 2025 16:45
@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch from 3a56eac to 5fe283a Compare August 8, 2025 10:04
@joamaki joamaki added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 8, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 8, 2025
@joamaki joamaki added the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Aug 8, 2025
@joamaki
Copy link
Contributor Author

joamaki commented Aug 8, 2025

/test

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch from 5fe283a to 42fd2c8 Compare August 8, 2025 12:37
@joamaki
Copy link
Contributor Author

joamaki commented Aug 8, 2025

/test

@joamaki joamaki marked this pull request as ready for review August 11, 2025 09:30
@joamaki joamaki requested review from a team as code owners August 11, 2025 09:30
@joamaki joamaki enabled auto-merge August 11, 2025 09:31
@borkmann borkmann requested a review from aojea August 11, 2025 15:10
@joamaki
Copy link
Contributor Author

joamaki commented Aug 12, 2025

Here is a more detailed explanation what's going on using a simple HTTP server that gracefully shuts down on SIGTERM and set it up with a readiness probe.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: unready-pod
  labels:
    app: unready
spec:
  replicas: 1
  selector:
    matchLabels:
      app: unready
  template:
    metadata:
      labels:
        app: unready
    spec:
      terminationGracePeriodSeconds: 120
      containers:
      - name: unready
        image: test.jussi/testpod:0.1
        readinessProbe:
          periodSeconds: 1
          httpGet:
            path: /ready
            port: 8080
        livenessProbe:
          exec:
            command:
            - "true"
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: svc-unready
spec:
  selector:
    app: unready
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
package main

import (
        "context"
        "fmt"
        "log/slog"
        "net/http"
        "os"
        "os/signal"
        "syscall"
        "time"
)

func main() {
        srv := &http.Server{Addr: ":8080"}
        http.HandleFunc("/sleep", sleeper)
        http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
                w.WriteHeader(http.StatusOK)
        })
        slog.Info("Listening on :8080")
        go srv.ListenAndServe()

        sigs := make(chan os.Signal, 1)
        signal.Notify(sigs, syscall.SIGTERM)
        <-sigs
        slog.Info("SIGTERM received, gracefully shutting down")
        ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
        defer cancel()
        srv.Shutdown(ctx)
}

func sleeper(w http.ResponseWriter, r *http.Request) {
        slog.Info("Request received", "method", r.Method, "url", r.URL, "remote", r.RemoteAddr)
        fmt.Fprintf(w, "sleeping 60 seconds...\n")
        w.(http.Flusher).Flush()
        time.Sleep(60 * time.Second)
}

I'm curl'ing the server from a pod in another node:

kubectl run --image=gcr.io/kubernetes-e2e-test-images/agnhost:2.6 --overrides='{"apiVersion": "v1", "spec": {"nodeSelector": { "kubernetes.io/hostname": "kind-control-plane" }}}'  test
time kubectl exec test -- curl --keepalive-time 1 'svc-unready:80/sleep'

When I scale down the deployment to 0 replicas the HTTP server receives a SIGTERM which closes the listening
socket and thus failing the readiness probe. This in turn flips serving to false (evidenced by kubectl get endpointslices/svc-unready-hdt4z -o yaml -w|grep -A3 'conditions:')

Cilium without this will remove the backend from both services and backends maps when serving=false leaving a dangling reference in the conntrack entry and thus results in a RST packet when the TCP keepalive hits it, e.g. curl terminates before 60s with curl: (56) Recv failure: Connection refused.

Cilium with this fix will mark the backend as quarantined as soon as it has serving=false and keeps it in the backend map without changes. This allows the existing connections to continue.

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch 3 times, most recently from 47faf02 to 3189234 Compare August 12, 2025 09:23
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great now.

@joamaki
Copy link
Contributor Author

joamaki commented Aug 12, 2025

/test

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch 2 times, most recently from 2dc7912 to d85ce8f Compare August 12, 2025 14:57
@joamaki
Copy link
Contributor Author

joamaki commented Aug 12, 2025

/test

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a clean implementation, mostly LGTM!
The main change in this PR is to exclude backends that are terminating but not serving from load balancing, even if they are the only available backend.
I wish Kubernetes integrated such conditions in the service endpoint updates, eliminating the need to add another "terminating-not-serving" state. Couple of follow-ups:
(1) Do we need to persist this new state in the BPF LB map?
(2) Can you extend the documentation - https://docs.cilium.io/en/latest/network/kubernetes/kubeproxy-free/#graceful-termination?

@joamaki
Copy link
Contributor Author

joamaki commented Aug 13, 2025

It's a clean implementation, mostly LGTM! The main change in this PR is to exclude backends that are terminating but not serving from load balancing, even if they are the only available backend. I wish Kubernetes integrated such conditions in the service endpoint updates, eliminating the need to add another "terminating-not-serving" state. Couple of follow-ups: (1) Do we need to persist this new state in the BPF LB map? (2) Can you extend the documentation - https://docs.cilium.io/en/latest/network/kubernetes/kubeproxy-free/#graceful-termination?

  1. No need to persist anything differently. Essentially the difference to "terminating" is just that we don't pick the backend up as a fallback if no active exist (e.g. won't go into services map).

  2. Ah good point! I'll update it with the new information learned during this.

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch from d85ce8f to c7c0dd6 Compare August 13, 2025 07:18
@joamaki joamaki requested a review from a team as a code owner August 13, 2025 07:18
@joamaki joamaki requested a review from hemanthmalla August 13, 2025 07:18
Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks Jussi

@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch from c7c0dd6 to 803742a Compare August 13, 2025 14:45
@joamaki
Copy link
Contributor Author

joamaki commented Aug 13, 2025

Saw at least one failure around https://github.com/kubernetes/kubernetes/blob/790393ae92e97262827d4f1fba24e8ae65bbada0/test/e2e/network/service.go#L3041C1-L3042C1 in the k8s e2e which seems very suspicious given the changes in this PR. I think we might be implementing the terminating endpoints fallback incorrectly. Though looking at the test I think there might be a chance that it would try to connect before Cilium has processed the EndpointSlice change in which case it wouldn't error out. Will need to dive into this before I dare merging this PR (even though it seems that it is unrelated).

EDIT: Ah looks like we started seeing this failure when we bumped the version: https://github.com/cilium/cilium/actions/runs/16915128150. The failing test case was added already 3 years ago so it's something else that they've changed that started introducing this issue. It could still be we have a bug in how we handle the terminating endpoint fallback, but it's not related to this PR. I'll look into that issue separately, but not have it block this one.

@joamaki
Copy link
Contributor Author

joamaki commented Aug 14, 2025

/test

A terminating pod that no longer signals readiness will be seen as an
endpoint with {ready: false, serving: false, terminating: true}. These
were skipped leading to the backend being removed from the backends map
which causes connection disruptions.

Fix this by keeping the non-serving terminating backends in the backends BPF
map, but not using it as a fallback backend if only terminating backends exists.

This is implemented by:
- Including all conditions in [k8s.Backend], not just Terminating.
- Adding a new backend state (TerminatingNotServing) used with Terminating&!Serving conditions.
- Not considering TerminatingNotServing backends when selecting the backends.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/lb-graceful-termination-of-not-serving branch from 803742a to 1059edd Compare August 15, 2025 09:31
@joamaki
Copy link
Contributor Author

joamaki commented Aug 15, 2025

/test

@joamaki joamaki added this pull request to the merge queue Aug 15, 2025
Merged via the queue into cilium:main with commit 6f41c98 Aug 15, 2025
68 checks passed
@joamaki joamaki deleted the pr/joamaki/lb-graceful-termination-of-not-serving branch August 15, 2025 12:57
@giorio94 giorio94 removed the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Aug 18, 2025
@giorio94
Copy link
Member

Dropped the needs-backport label for the moment, given #41199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants