-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.17.0 and lower than v1.18.0
What happened?
Hello,
We use Cilium 1.17.0 with kube-proxy replacement mode enabled.
We have recently installed the Datadog agent on our clusters. The Datadog agent expose a Datadog service. The clients will send a continuous UDP stream (using the statsd
protocol) to this service.
At first, everything seemed to work well, but after a few days we noticed an extreme increase in the number of dropped packets reported by Hubble exporter.
After investigation, these packets were drop as the destination IP was no longer the IP of an existing Datadog pod (that would register as an endpoint in the Datadog service).
Once we restart the Datadog Daemonset, the statsd
protocol stop working.
I started to try to understand how Cilium ended up in this state. My theory is that when we open a socket with a service IP, the service IP will be replaced by an endpoint IP in the following Kernel function (on RHEL9) by Cilium eBPF:
https://elixir.bootlin.com/linux/v5.14/source/kernel/bpf/cgroup.c#L1066
As this happen at socket creation, each time the Datadog agent will use the write
syscall on the socket file descriptor, the IP of the endpoint found in the sock structure will be used.
This is something that work very well for TCP connections as they will be broken when the old endpoints are stopped and then be reopened later. It also doesn't cause issue with short lived UDP connections. But this cannot work for long lived unidirectional UDP connections as the source have no way to know that the socket must be reopened.
Obviously these is some simple protocol changes that would fix the issue on the Datadog side, and I could also deploy Datadog using hostNetwork. Meanwhile I'm reporting this issue as this is a legit network communication pattern not properly supported by Cilium at the moment.
Best Regards.
How can we reproduce the issue?
- Install Cilium 1.17.0 with kube-proxy replacement mode
- Create a deployment of pods exposing an UDP server
- Create a service with these pods as endpoints
- Create a deployment continuously sending packets to this service
- Restart the server deployment
- The UDP packets no longer reach the servers
Cilium Version
1.17.0
Kernel Version
5.14.0-503.16.1.el9_5.x86_64
Kubernetes Version
1.31.2
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct