SocketLB: Terminate connections for services with mixed protocols #37745

foyerunix · 2025-02-19T13:06:38Z

Hello,

Following my issue, I continued my investigations to understand why the UDP sockets weren't closed.

I first added a log line at the beginning of TerminateUDPConnectionsToBackend to check if I enter the function as I ddin't see the "handling udp connections to deleted backend" log.

What I discovered is that the l3n4Addr.Protocol field wasn't set to "UDP" but to "ANY", which I understood being due to having both a TCP and an UDP port on my service.

I therefore added a check for "ANY" in l3n4Addr.Protocol and then set the protocol variable to UDP.

It resolved my issue in my environment, and, as I understand the code, cannot cause any new issue.

Best Regards.

Fix connections to deleted service backends not getting terminated in certain cases involving services with multiple protocol ports.

aditighag · 2025-02-25T23:06:29Z

/test

foyerunix · 2025-02-27T07:10:38Z

Hello @aditighag,

Thanks for running the tests. I've reviewed the failures:

"The action 'Wait for images to be available (downgrade)' has timed out after 10 minutes."

This seems to be a build error unrelated to my PR.

"Failed to execute tcpdump on cilium-test-3/host-netns-wtqnd (kind-worker2): command failed (pod=cilium-test-3/host-netns-wtqnd, container=): error reading from error stream: next reader: websocket: close 1006 (abnormal closure): unexpected EOF"

This appears to be an error during the test setup. Tcpdump needs to run to check if the captured packets match some condition.

"command 'curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}\n --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fc00:c111::5]:30755' failed: command failed (pod=cilium-test-4/host-netns-non-cilium-pjg8p, container=host-netns-non-cilium): command terminated with exit code 28"

This might indicate my PR is breaking TCP connections.

"Expected to capture packets, but none found. This check might be broken."

This could be a flaky test.

"timeout=5s error='rpc error: code = Unavailable desc = dns: A record lookup error: lookup hubble-peer.kube-system.svc.cluster.local. on 10.100.0.10:53: read udp 192.168.122.27:33793->10.100.0.10:53: i/o timeout' subsys=hubble-relay"

This might be due to my patch overclosing sockets.

Could you help me distinguish between CI issues and those caused by my fix?

Additionally, I'd appreciate your feedback on the code itself. It works in my environment without issues, but I'm open to reworking my PR if you see a better solution.

Best regards.

aditighag

Could you rebase your PR? There were certain CI issues that were fixed upstream recently.
I wonder if this is a regression introduced by the protocol differentiation changes - https://github.com/cilium/cilium/pull/33434/commits. Are you able to reproduce this issue by setting the config flag bpf-lb-proto-diff to false?
Also, I've added a release note to the PR description - https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request.

pkg/service/connections.go

foyerunix · 2025-03-03T07:02:06Z

Hello @aditighag,

Thank for fixing the release note, I thought my PR title was suitable as a release note.

I made the following test:

deploy Cilium 1.17.0
set bpf-lb-proto-diff to false in the Cilium configmap
restart the Cilium daemonset
restart the daemonset of the service that is receiving packets from long lived UDP connections

After looking at Hubble, I still see the symptomatic drops of UDP packets going to non-existing backends.

The PR was rebased and I accepted your change.

Best Regards.

aditighag · 2025-03-03T22:38:50Z

/test

aditighag · 2025-03-04T21:13:11Z

@foyerunix All tests passed. Can you drop my sign-off from the commit? Once you've force pushed to the PR, we need not run the full CI test suite before merging the PR.

aditighag · 2025-03-04T21:36:42Z

@foyerunix I wasn't able to reproduce the issue locally. See snippets from my testing below.

Can you share the repro steps including your service manifest?

$ k get svc
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
echo            ClusterIP   10.96.132.231   <none>        80/TCP,69/UDP     119s

$ ks logs cilium-mtskr | grep -i handling
time="2025-03-04T21:31:24.278942184Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:80/TCP" subsys=service
time="2025-03-04T21:31:24.279089498Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:69/UDP" subsys=service

diff --git a/pkg/service/connections.go b/pkg/service/connections.go
index 292d1ba4db..b12465a541 100644
--- a/pkg/service/connections.go
+++ b/pkg/service/connections.go
@@ -25,6 +25,7 @@ import (
 var opSupported = true

 func (s *Service) TerminateUDPConnectionsToBackend(l3n4Addr *lb.L3n4Addr) {
+       log.Infof("handling udp connections to deleted backend %v", l3n4Addr)
        // With socket-lb, existing client applications can continue to connect to
        // deleted backends. Destroy any client sockets connected to the deleted backend.
        if !(option.Config.EnableSocketLB || option.Config.BPFSocketLBHostnsOnly) {

foyerunix · 2025-03-06T05:22:50Z

Hello @aditighag,

I dropped your sign-off from the commit.

I didn't build a minimal reproducer that allow to reproduce the issue without having to install the Datadog agent.

I have looked at your snippets and:

time="2025-03-04T21:31:24.278942184Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:80/TCP" subsys=service
time="2025-03-04T21:31:24.279089498Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:69/UDP" subsys=service

Is not what I got when I added debugging. You should have: 10.244.1.54:69/ANY.

Here is my service spec:

spec:
  clusterIP: [redacted]
  clusterIPs:
  - [redacted]
  internalTrafficPolicy: Local
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dogstatsdport
    port: 8125
    protocol: UDP
    targetPort: 8125
  - name: traceport
    port: 8126
    protocol: TCP
    targetPort: 8126
  selector:
    app: datadog
  sessionAffinity: None
  type: ClusterIP

Best Regards.

aditighag · 2025-03-06T15:11:15Z

What cilium version have you installed on your cluster? Can you try installing the latest version?

joestringer · 2025-03-24T04:27:08Z

Seems like this stalled - are we looking for further confirmation from @foyerunix or do we think the change is right? As far as I follow it seems like the calling code doesn't discriminate on the protocol of the service, and ANY could mean there are UDP connections. Or do we already expand out ANY to each individual protocol in l3n4Addr?

I couldn't immediately find tests for this functionality but I guess if we extended the testsuite that could help to prove the correct behavior.

foyerunix · 2025-03-25T13:20:02Z

Hello @joestringer,

I saw that @aditighag asked me to do a test with the latest Cilium version. Meanwhile I don't have the Datadog agent installed in a sandbox environment. I need either to start a Cilium upgrade cycle on my clusters, or to take the risk to just launch the latest Cilium agent on an impacted cluster as is. I'm still unsure about what to do.

I run Cilium 1.17.0.

Best Regards.

aditighag · 2025-03-25T15:46:05Z

Hi @foyerunix We are getting more test coverage as part of this PR - #37600. /cc @tommyp1ckles
@tommyp1ckles Would you recommend @foyerunix to cherry pick your commit, and see if the issue he reported still persists?

foyerunix · 2025-03-26T10:25:20Z

Hello @aditighag,

I can confirm that I have the same issue with Cilium 1.17.2.

borkmann · 2025-04-02T17:05:59Z

I have looked at your snippets and:

time="2025-03-04T21:31:24.278942184Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:80/TCP" subsys=service
time="2025-03-04T21:31:24.279089498Z" level=info msg="handling udp connections to deleted backend 10.244.1.54:69/UDP" subsys=service

Does this mean that we also attempt to close TCP from above, but eventually fail since socket cookie + TCP doesn't exist?
I think temporary we should land this, but iirc @tommyp1ckles might be following up with actually terminating TCP as well which should then also land till 1.18.

aditighag · 2025-04-14T17:12:05Z

/test

aditighag · 2025-04-14T17:16:35Z

Hi @foyerunix Could you rebase your PR?

Add a check for services whose protocol is "ANY" to close their UDP connections too. Fixes: cilium#37577 Signed-off-by: foyerunix <foyerunix@foyer.lu>

foyerunix · 2025-04-16T06:14:59Z

Hi @aditighag,

Done.

Best Regards.

aditighag · 2025-04-16T14:45:08Z

/test

aditighag · 2025-04-16T21:06:27Z

🟥 failed to flush ct entries: %w command failed (pod=kube-system/cilium-zds6t, container=cilium-agent): "time="2025-04-16T15:13:14.440649371Z" level=info msg="completed ctmap gc pass" aliveEntries=0 deleted=3611 family=ipv4 proto=TCP skipped=0 subsys=map-ct\ntime="2025-04-16T15:13:14.444586669Z" level=info msg="completed ctmap gc pass" aliveEntries=0 deleted=422 family=ipv4 proto=non-TCP skipped=0 subsys=map-ct\ntime="2025-04-16T15:13:14.446563167Z" level=info msg="completed ctmap gc pass" aliveEntries=0 deleted=120 family=ipv6 proto=TCP skipped=0 subsys=map-ct\ntime="2025-04-16T15:13:14.447961547Z" level=info msg="completed ctmap gc pass" aliveEntries=0 deleted=66 family=ipv6 proto=non-TCP skipped=0 subsys=map-ct\n"

Hit unrelated CI issue.

Context - #38956.

deadok22 · 2025-04-25T02:45:34Z

Hey @foyerunix, thanks for the fix!

We just ran into this very issue!

Could you help me understand the new behavior?

Once s.backendConnectionHandler.Destroy(...) is called, what is going to happen, from the perspective of a client attempting writes through a connected udp socket? Do such clients get to continue using the existing socket fd or do they need to make a new one?

foyerunix · 2025-04-25T11:03:35Z

Hello @deadok22,

My PR doesn't change the behavior of socketLB socket termination. It just ensure that UDP connections are terminated when the LB type is "ANY". As I understand, the client need to open a new socket as the kernel forcefully close the existing one.

julianwiedmann · 2025-05-07T07:12:02Z

👋 @foyerunix @aditighag curious, is this essentially caused by the termination code assuming that all services are created with L4 protocol differentation (-> #33434) - and we didn't consider services that were created on a pre-v1.17 cluster with ANY protocol?

If so I think we should backport this fix to v1.17. And add some tests to confirm that we're handling such legacy service definitions as expected.

foyerunix · 2025-05-07T11:15:15Z

Hello @julianwiedmann,

I can confirm that my Datadog service was created before upgrading to Cilium 1.17.

ysksuzuki · 2025-05-12T05:31:06Z

When loadBalancer.protocolDifferentiation.enabled=false, the service protocol is treated as ANY. In this case, TerminateUDPConnectionsToBackend does not terminate UDP sockets. Since v1.17 supports protocol differentiation but it may be disabled, we should backport this fix.

cilium/pkg/k8s/watchers/service.go

Lines 554 to 558 in 7206f0f

    
           if !option.Config.LoadBalancerProtocolDifferentiation { 
        
           	oldSvc = stripServiceProtocol(oldSvc) 
        
           	svc = stripServiceProtocol(svc) 
        
           	endpoints = stripEndpointsProtocol(endpoints) 
        
           }

foyerunix requested a review from a team as a code owner February 19, 2025 13:06

foyerunix requested a review from aditighag February 19, 2025 13:06

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 19, 2025

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 19, 2025

joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Feb 20, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 20, 2025

aditighag requested changes Feb 28, 2025

View reviewed changes

pkg/service/connections.go Outdated Show resolved Hide resolved

aditighag added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Feb 28, 2025

foyerunix force-pushed the fix-socket-force-close branch 2 times, most recently from dc3fe8c to d9c69e1 Compare March 3, 2025 06:54

aditighag removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 3, 2025

maintainer-s-little-helper bot requested a review from aditighag March 3, 2025 23:28

foyerunix force-pushed the fix-socket-force-close branch from d9c69e1 to 4791a1e Compare March 6, 2025 05:15

aditighag added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 14, 2025

foyerunix force-pushed the fix-socket-force-close branch from 4791a1e to 9ca3361 Compare April 16, 2025 05:42

SocketLB: Terminate connections for services with mixed protocols

7a9a085

Add a check for services whose protocol is "ANY" to close their UDP connections too. Fixes: cilium#37577 Signed-off-by: foyerunix <foyerunix@foyer.lu>

foyerunix force-pushed the fix-socket-force-close branch from 9ca3361 to 7a9a085 Compare April 16, 2025 06:12

aditighag removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 16, 2025

aditighag approved these changes Apr 16, 2025

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 16, 2025

aditighag added this pull request to the merge queue Apr 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 16, 2025

julianwiedmann added this pull request to the merge queue Apr 17, 2025

Merged via the queue into cilium:main with commit d3c80fa Apr 17, 2025
67 checks passed

ysksuzuki added the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label May 12, 2025

nbusseneau mentioned this pull request May 15, 2025

v1.17 Backports 2025-05-15 #39564

Merged

9 tasks

nbusseneau added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels May 15, 2025

github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels May 19, 2025

tchellomello mentioned this pull request Jul 2, 2025

socket-LB terminate-pod-connections inconsistent behavior: fails for long-running pods, works for new pods #40336

Open

3 tasks

SocketLB: Terminate connections for services with mixed protocols #37745

SocketLB: Terminate connections for services with mixed protocols #37745

Uh oh!

Conversation

foyerunix commented Feb 19, 2025 • edited by aditighag Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditighag commented Feb 25, 2025

Uh oh!

foyerunix commented Feb 27, 2025

Uh oh!

aditighag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

foyerunix commented Mar 3, 2025

Uh oh!

aditighag commented Mar 3, 2025

Uh oh!

aditighag commented Mar 4, 2025

Uh oh!

aditighag commented Mar 4, 2025

Uh oh!

foyerunix commented Mar 6, 2025

Uh oh!

aditighag commented Mar 6, 2025

Uh oh!

joestringer commented Mar 24, 2025

Uh oh!

foyerunix commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditighag commented Mar 25, 2025

Uh oh!

foyerunix commented Mar 26, 2025

Uh oh!

borkmann commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aditighag commented Apr 14, 2025

Uh oh!

aditighag commented Apr 14, 2025

Uh oh!

foyerunix commented Apr 16, 2025

Uh oh!

aditighag commented Apr 16, 2025

Uh oh!

aditighag commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deadok22 commented Apr 25, 2025

Uh oh!

foyerunix commented Apr 25, 2025

Uh oh!

julianwiedmann commented May 7, 2025

Uh oh!

foyerunix commented May 7, 2025

Uh oh!

ysksuzuki commented May 12, 2025

Uh oh!

Uh oh!

foyerunix commented Feb 19, 2025 •

edited by aditighag

Loading

foyerunix commented Mar 25, 2025 •

edited

Loading

borkmann commented Apr 2, 2025 •

edited

Loading

aditighag commented Apr 16, 2025 •

edited

Loading