L2 Neighbor Discovery failures lead to unbounded Cilium memory growth in v1.16.0+

### Is there an existing issue for this?

- [X] I have searched the existing issues

### Version

higher than v1.16.0 and lower than v1.17.0

### What happened?

After upgrading clusters to Cilium v1.16.0 we noticed dramatic memory growth. The memory growth seems to correspond to the cluster size in node count, where our cluster with 1k+ nodes saw cilium agent using 10GB+ of memory (previous <1GB on 1.15.7)
![Screenshot 2024-07-25 at 13-54-32 Cilium Overview v2 - Cluster Team Datadog](https://github.com/user-attachments/assets/7e3214ec-e811-4393-99c2-f95ee1a886f8)

Investigation with pprof revealed most of the objects are error objects:
```
File: cilium-agent
Type: inuse_space
Time: Jul 25, 2024 at 12:52pm (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 4156.41MB, 94.32% of 4406.62MB total
Dropped 593 nodes (cum <= 22.03MB)
Showing top 10 nodes out of 51
      flat  flat%   sum%        cum   cum%
 3978.99MB 90.30% 90.30%  3978.99MB 90.30%  errors.(*joinError).Error
   44.09MB  1.00% 91.30%    44.09MB  1.00%  reflect.mapassign_faststr0
   41.50MB  0.94% 92.24%    41.50MB  0.94%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
   31.51MB  0.72% 92.95%    31.51MB  0.72%  github.com/cilium/cilium/pkg/k8s/slim/k8s/apis/meta/v1.(*ObjectMeta).Unmarshal
   30.51MB  0.69% 93.65%    30.51MB  0.69%  github.com/cilium/cilium/pkg/node/manager.(*manager).StartNeighborRefresh.func1
   22.50MB  0.51% 94.16%    22.50MB  0.51%  reflect.cvtBytesString
    2.80MB 0.064% 94.22%    25.68MB  0.58%  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).ciliumEndpointsInit.func2
    2.50MB 0.057% 94.28%    22.38MB  0.51%  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).endpointUpdated
       1MB 0.023% 94.30%    23.54MB  0.53%  github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.(*CiliumNode).DeepCopy
       1MB 0.023% 94.32%    42.78MB  0.97%  k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
```

Further debug logs also appear to show the problem, an error with the l2 neighbor discovery was joined 1000+ times to create a massive string:

```

time="2024-07-25T18:43:22Z" level=debug msg="Controller run failed" consecutiveErrors=24 error="unable to determine next hop IPv4 address for eth1 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.217.244): remote node IP is non-routable\nunable to determine next
...
```
(Truncated but went on for thousands of more bytes)

### How can we reproduce the issue?

Cilium installed with Helm, Kube-proxy replacement = true, AWS ENI IPAM enabled

### Cilium Version

v1.16.0

### Kernel Version

5.10.219-208.866.amzn2.x86_64

### Kubernetes Version

v1.29.3-eks-ae9a62a

### Regression

v1.15.7

### Sysdump

_No response_

### Relevant log output

_No response_

### Anything else?

_No response_

### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

L2 Neighbor Discovery failures lead to unbounded Cilium memory growth in v1.16.0+ #34020

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

L2 Neighbor Discovery failures lead to unbounded Cilium memory growth in v1.16.0+ #34020

Description

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions