Skip to content

L2 Neighbor Discovery failures lead to unbounded Cilium memory growth in v1.16.0+ #34020

@cnmcavoy

Description

@cnmcavoy

Is there an existing issue for this?

  • I have searched the existing issues

Version

higher than v1.16.0 and lower than v1.17.0

What happened?

After upgrading clusters to Cilium v1.16.0 we noticed dramatic memory growth. The memory growth seems to correspond to the cluster size in node count, where our cluster with 1k+ nodes saw cilium agent using 10GB+ of memory (previous <1GB on 1.15.7)
Screenshot 2024-07-25 at 13-54-32 Cilium Overview v2 - Cluster Team Datadog

Investigation with pprof revealed most of the objects are error objects:

File: cilium-agent
Type: inuse_space
Time: Jul 25, 2024 at 12:52pm (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 4156.41MB, 94.32% of 4406.62MB total
Dropped 593 nodes (cum <= 22.03MB)
Showing top 10 nodes out of 51
      flat  flat%   sum%        cum   cum%
 3978.99MB 90.30% 90.30%  3978.99MB 90.30%  errors.(*joinError).Error
   44.09MB  1.00% 91.30%    44.09MB  1.00%  reflect.mapassign_faststr0
   41.50MB  0.94% 92.24%    41.50MB  0.94%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
   31.51MB  0.72% 92.95%    31.51MB  0.72%  github.com/cilium/cilium/pkg/k8s/slim/k8s/apis/meta/v1.(*ObjectMeta).Unmarshal
   30.51MB  0.69% 93.65%    30.51MB  0.69%  github.com/cilium/cilium/pkg/node/manager.(*manager).StartNeighborRefresh.func1
   22.50MB  0.51% 94.16%    22.50MB  0.51%  reflect.cvtBytesString
    2.80MB 0.064% 94.22%    25.68MB  0.58%  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).ciliumEndpointsInit.func2
    2.50MB 0.057% 94.28%    22.38MB  0.51%  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).endpointUpdated
       1MB 0.023% 94.30%    23.54MB  0.53%  github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.(*CiliumNode).DeepCopy
       1MB 0.023% 94.32%    42.78MB  0.97%  k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop

Further debug logs also appear to show the problem, an error with the l2 neighbor discovery was joined 1000+ times to create a massive string:


time="2024-07-25T18:43:22Z" level=debug msg="Controller run failed" consecutiveErrors=24 error="unable to determine next hop IPv4 address for eth1 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.217.244): remote node IP is non-routable\nunable to determine next
...

(Truncated but went on for thousands of more bytes)

How can we reproduce the issue?

Cilium installed with Helm, Kube-proxy replacement = true, AWS ENI IPAM enabled

Cilium Version

v1.16.0

Kernel Version

5.10.219-208.866.amzn2.x86_64

Kubernetes Version

v1.29.3-eks-ae9a62a

Regression

v1.15.7

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.kind/performanceThere is a performance impact of this.needs/triageThis issue requires triaging to establish severity and next steps.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions