-
Notifications
You must be signed in to change notification settings - Fork 3.4k
SocketLB: Terminate connections for services with mixed protocols #37745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
Hello @aditighag, Thanks for running the tests. I've reviewed the failures: "The action 'Wait for images to be available (downgrade)' has timed out after 10 minutes."
"Failed to execute tcpdump on cilium-test-3/host-netns-wtqnd (kind-worker2): command failed (pod=cilium-test-3/host-netns-wtqnd, container=): error reading from error stream: next reader: websocket: close 1006 (abnormal closure): unexpected EOF"
"command 'curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}\n --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://[fc00:c111::5]:30755' failed: command failed (pod=cilium-test-4/host-netns-non-cilium-pjg8p, container=host-netns-non-cilium): command terminated with exit code 28"
"Expected to capture packets, but none found. This check might be broken."
"timeout=5s error='rpc error: code = Unavailable desc = dns: A record lookup error: lookup hubble-peer.kube-system.svc.cluster.local. on 10.100.0.10:53: read udp 192.168.122.27:33793->10.100.0.10:53: i/o timeout' subsys=hubble-relay"
Could you help me distinguish between CI issues and those caused by my fix? Additionally, I'd appreciate your feedback on the code itself. It works in my environment without issues, but I'm open to reworking my PR if you see a better solution. Best regards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase your PR? There were certain CI issues that were fixed upstream recently.
I wonder if this is a regression introduced by the protocol differentiation changes - https://github.com/cilium/cilium/pull/33434/commits. Are you able to reproduce this issue by setting the config flag bpf-lb-proto-diff
to false
?
Also, I've added a release note to the PR description - https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request.
dc3fe8c
to
d9c69e1
Compare
Hello @aditighag, Thank for fixing the release note, I thought my PR title was suitable as a release note. I made the following test:
After looking at Hubble, I still see the symptomatic drops of UDP packets going to non-existing backends. The PR was rebased and I accepted your change. Best Regards. |
/test |
@foyerunix All tests passed. Can you drop my sign-off from the commit? Once you've force pushed to the PR, we need not run the full CI test suite before merging the PR. |
@foyerunix I wasn't able to reproduce the issue locally. See snippets from my testing below. Can you share the repro steps including your service manifest?
|
d9c69e1
to
4791a1e
Compare
Hello @aditighag, I dropped your sign-off from the commit. I didn't build a minimal reproducer that allow to reproduce the issue without having to install the Datadog agent. I have looked at your snippets and:
Is not what I got when I added debugging. You should have: Here is my service spec:
Best Regards. |
What cilium version have you installed on your cluster? Can you try installing the latest version? |
Seems like this stalled - are we looking for further confirmation from @foyerunix or do we think the change is right? As far as I follow it seems like the calling code doesn't discriminate on the protocol of the service, and ANY could mean there are UDP connections. Or do we already expand out ANY to each individual protocol in I couldn't immediately find tests for this functionality but I guess if we extended the testsuite that could help to prove the correct behavior. |
Hello @joestringer, I saw that @aditighag asked me to do a test with the latest Cilium version. Meanwhile I don't have the Datadog agent installed in a sandbox environment. I need either to start a Cilium upgrade cycle on my clusters, or to take the risk to just launch the latest Cilium agent on an impacted cluster as is. I'm still unsure about what to do. I run Cilium 1.17.0. Best Regards. |
Hi @foyerunix We are getting more test coverage as part of this PR - #37600. /cc @tommyp1ckles |
Hello @aditighag, I can confirm that I have the same issue with Cilium 1.17.2. |
Does this mean that we also attempt to close TCP from above, but eventually fail since socket cookie + TCP doesn't exist? |
/test |
Hi @foyerunix Could you rebase your PR? |
4791a1e
to
9ca3361
Compare
Add a check for services whose protocol is "ANY" to close their UDP connections too. Fixes: cilium#37577 Signed-off-by: foyerunix <foyerunix@foyer.lu>
9ca3361
to
7a9a085
Compare
Hi @aditighag, Done. Best Regards. |
/test |
Hit unrelated CI issue. Context - #38956. |
Hey @foyerunix, thanks for the fix! We just ran into this very issue! Could you help me understand the new behavior? Once |
Hello @deadok22, My PR doesn't change the behavior of socketLB socket termination. It just ensure that UDP connections are terminated when the LB type is "ANY". As I understand, the client need to open a new socket as the kernel forcefully close the existing one. |
👋 @foyerunix @aditighag curious, is this essentially caused by the termination code assuming that all services are created with L4 protocol differentation (-> #33434) - and we didn't consider services that were created on a pre-v1.17 cluster with ANY protocol? If so I think we should backport this fix to v1.17. And add some tests to confirm that we're handling such legacy service definitions as expected. |
Hello @julianwiedmann, I can confirm that my Datadog service was created before upgrading to Cilium 1.17. |
When cilium/pkg/k8s/watchers/service.go Lines 554 to 558 in 7206f0f
|
Hello,
Fixes: #37577
Following my issue, I continued my investigations to understand why the UDP sockets weren't closed.
I first added a log line at the beginning of
TerminateUDPConnectionsToBackend
to check if I enter the function as I ddin't see the "handling udp connections to deleted backend" log.What I discovered is that the
l3n4Addr.Protocol
field wasn't set to "UDP" but to "ANY", which I understood being due to having both a TCP and an UDP port on my service.I therefore added a check for "ANY" in
l3n4Addr.Protocol
and then set theprotocol
variable to UDP.It resolved my issue in my environment, and, as I understand the code, cannot cause any new issue.
Best Regards.