-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Closed
Closed
Copy link
Labels
area/agentCilium agent related.Cilium agent related.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.kind/performanceThere is a performance impact of this.There is a performance impact of this.needs/triageThis issue requires triaging to establish severity and next steps.This issue requires triaging to establish severity and next steps.
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
higher than v1.16.0 and lower than v1.17.0
What happened?
After upgrading clusters to Cilium v1.16.0 we noticed dramatic memory growth. The memory growth seems to correspond to the cluster size in node count, where our cluster with 1k+ nodes saw cilium agent using 10GB+ of memory (previous <1GB on 1.15.7)
Investigation with pprof revealed most of the objects are error objects:
File: cilium-agent
Type: inuse_space
Time: Jul 25, 2024 at 12:52pm (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 4156.41MB, 94.32% of 4406.62MB total
Dropped 593 nodes (cum <= 22.03MB)
Showing top 10 nodes out of 51
flat flat% sum% cum cum%
3978.99MB 90.30% 90.30% 3978.99MB 90.30% errors.(*joinError).Error
44.09MB 1.00% 91.30% 44.09MB 1.00% reflect.mapassign_faststr0
41.50MB 0.94% 92.24% 41.50MB 0.94% sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
31.51MB 0.72% 92.95% 31.51MB 0.72% github.com/cilium/cilium/pkg/k8s/slim/k8s/apis/meta/v1.(*ObjectMeta).Unmarshal
30.51MB 0.69% 93.65% 30.51MB 0.69% github.com/cilium/cilium/pkg/node/manager.(*manager).StartNeighborRefresh.func1
22.50MB 0.51% 94.16% 22.50MB 0.51% reflect.cvtBytesString
2.80MB 0.064% 94.22% 25.68MB 0.58% github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).ciliumEndpointsInit.func2
2.50MB 0.057% 94.28% 22.38MB 0.51% github.com/cilium/cilium/pkg/k8s/watchers.(*K8sCiliumEndpointsWatcher).endpointUpdated
1MB 0.023% 94.30% 23.54MB 0.53% github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.(*CiliumNode).DeepCopy
1MB 0.023% 94.32% 42.78MB 0.97% k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
Further debug logs also appear to show the problem, an error with the l2 neighbor discovery was joined 1000+ times to create a massive string:
time="2024-07-25T18:43:22Z" level=debug msg="Controller run failed" consecutiveErrors=24 error="unable to determine next hop IPv4 address for eth1 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.193.254): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.194.75): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.213.40): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth2 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth3 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth4 (10.115.214.19): remote node IP is non-routable\nunable to determine next hop IPv4 address for eth1 (10.115.217.244): remote node IP is non-routable\nunable to determine next
...
(Truncated but went on for thousands of more bytes)
How can we reproduce the issue?
Cilium installed with Helm, Kube-proxy replacement = true, AWS ENI IPAM enabled
Cilium Version
v1.16.0
Kernel Version
5.10.219-208.866.amzn2.x86_64
Kubernetes Version
v1.29.3-eks-ae9a62a
Regression
v1.15.7
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct
alex-berger, armondressler, Jean-Daniel, fstr, Shreyank031 and 3 morePKizzle and Shreyank031
Metadata
Metadata
Assignees
Labels
area/agentCilium agent related.Cilium agent related.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.kind/performanceThere is a performance impact of this.There is a performance impact of this.needs/triageThis issue requires triaging to establish severity and next steps.This issue requires triaging to establish severity and next steps.