Long lived unidirectional UDP connections trough a service failing to reach new endpoints

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Version

equal or higher than v1.17.0 and lower than v1.18.0

### What happened?

Hello,

We use Cilium 1.17.0 with kube-proxy replacement mode enabled.

We have recently installed the Datadog agent on our clusters. The Datadog agent expose a Datadog service. The clients will send a continuous UDP stream (using the `statsd` protocol) to this service.

At first, everything seemed to work well, but after a few days we noticed an extreme increase in the number of dropped packets reported by Hubble exporter.

After investigation, these packets were drop as the destination IP was no longer the IP of an existing Datadog pod (that would register as an endpoint in the Datadog service).

Once we restart the Datadog Daemonset, the `statsd` protocol stop working.

I started to try to understand how Cilium ended up in this state. My theory is that when we open a socket with a service IP, the service IP will be replaced by an endpoint IP in the following Kernel function (on RHEL9) by Cilium eBPF:

[https://elixir.bootlin.com/linux/v5.14/source/kernel/bpf/cgroup.c#L1066](https://elixir.bootlin.com/linux/v5.14/source/kernel/bpf/cgroup.c#L1066)

As this happen at socket creation, each time the Datadog agent will use the `write` syscall on the socket file descriptor, the IP of the endpoint found in the sock structure will be used.

This is something that work very well for TCP connections as they will be broken when the old endpoints are stopped and then be reopened later. It also doesn't cause issue with short lived UDP connections. But this cannot work for long lived unidirectional UDP connections as the source have no way to know that the socket must be reopened.

Obviously these is some simple protocol changes that would fix the issue on the Datadog side, and I could also deploy Datadog using hostNetwork. Meanwhile I'm reporting this issue as this is a legit network communication pattern not properly supported by Cilium at the moment.

Best Regards.

### How can we reproduce the issue?

1. Install Cilium 1.17.0 with kube-proxy replacement mode
2. Create a deployment of pods exposing an UDP server
3. Create a service with these pods as endpoints
4. Create a deployment continuously sending packets to this service
5. Restart the server deployment
6. The UDP packets no longer reach the servers

### Cilium Version

1.17.0

### Kernel Version

5.14.0-503.16.1.el9_5.x86_64

### Kubernetes Version

1.31.2

### Regression

_No response_

### Sysdump

_No response_

### Relevant log output

```shell

```

### Anything else?

_No response_

### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long lived unidirectional UDP connections trough a service failing to reach new endpoints #37577

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long lived unidirectional UDP connections trough a service failing to reach new endpoints #37577

Description

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions