Skip to content

Conversation

mhofstetter
Copy link
Member

@mhofstetter mhofstetter commented May 6, 2025

Currently, starting the Cilium Agent can result in the following
warning when collecting metrics on the ratelimit metrics map starts
before the daemon initialized the ratelimit metrics map.

2025-05-06T10:17:39.944482772Z time="2025-05-06T10:17:39.941569631Z" level=warning msg="Failed to read ratelimit metrics from BPF map" source="/go/src/github.com/cilium/cilium/pkg/maps/ratelimitmap/ratelimitmap.go:203" error="loading pinned map /sys/fs/bpf/tc/globals/cilium_ratelimit_metrics: no such file or directory" subsys=ratelimit-map

The reason is that the map gets initialized via the daemon init logic
(that calls ratelimitmap.InitMaps - while the metrics collector
gets registered via Hive lifecycle.

Therefore, this commit initializes the ratelimit metrics map
with a Hive lifecycle hook and lets the metrics registration depend
on the map. This way it's ensured that the map is created when
the metrics collection starts.


Seen on #39357 in the ci-ginkgo tests (e.g. https://github.com/cilium/cilium/actions/runs/14856075257/job/41713009262). It's likely that the change of Daemon dependencies in that PR lead to exposing this issue. But the actual issue exists in the current codebase.

Agent logs from a sysdump

2025-05-06T10:17:36.972737837Z time=2025-05-06T10:17:36Z level=debug source=/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/job/oneshot.go:134 msg="Starting one-shot job" module=agent.infra.metrics name=collect func=metrics.(*sampler).collectLoop
2025-05-06T10:17:36.972752705Z time=2025-05-06T10:17:36Z level=debug source=/go/src/github.com/cilium/cilium/pkg/hive/health/provider.go:93 msg="upserting health status" module=health lastLevel=none reporter-id=agent.infra.metrics.job-collect status="agent.infra.metrics.job-collect: [OK] Running"
2025-05-06T10:17:36.972894950Z time="2025-05-06T10:17:36.972734997Z" level=warning msg="Failed to read ratelimit metrics from BPF map" source="/go/src/github.com/cilium/cilium/pkg/maps/ratelimitmap/ratelimitmap.go:203" error="loading pinned map /sys/fs/bpf/tc/globals/cilium_ratelimit_metrics: no such file or directory" subsys=ratelimit-map
...
2025-05-06T10:17:38.575349928Z time="2025-05-06T10:17:38.575265248Z" level=info msg="Initializing daemon" source="/go/src/github.com/cilium/cilium/daemon/cmd/daemon_main.go:1546" subsys=daemon

-> daemon init starts after ratelimit metrics collection

@mhofstetter mhofstetter added kind/bug This is a bug in the Cilium logic. area/modularization Relates to code modularization and maintenance. labels May 6, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 6, 2025
@mhofstetter mhofstetter added the release-note/misc This PR makes changes that have no direct user impact. label May 6, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 6, 2025
@mhofstetter mhofstetter requested a review from joamaki May 6, 2025 11:46
@mhofstetter mhofstetter marked this pull request as ready for review May 6, 2025 11:46
@mhofstetter mhofstetter requested a review from a team as a code owner May 6, 2025 11:46
Currently, the ratelimit map is initialized by the daemon init
logic that is calling `ratelimitmap.InitMaps`.

This commit refactors the ratelimitmap cell to init the map with
a Hive lifecycle hook.

Note: this is a preparation to do the same with the ratelimit metrics
map where this refactoring is required to prevent races.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, starting the Cilium Agent can result in the following
warning when collecting metrics on the ratelimit metrics map starts
before the daemon initialized the ratelimit metrics map.

```
2025-05-06T10:17:39.944482772Z time="2025-05-06T10:17:39.941569631Z" level=warning msg="Failed to read ratelimit metrics from BPF map" source="/go/src/github.com/cilium/cilium/pkg/maps/ratelimitmap/ratelimitmap.go:203" error="loading pinned map /sys/fs/bpf/tc/globals/cilium_ratelimit_metrics: no such file or directory" subsys=ratelimit-map
```

The reason is that the map gets initialized via the daemon init logic
(that calls `ratelimitmap.InitMaps` - while the metrics collector
gets registered via Hive lifecycle.

Therefore, this commit initializes the ratelimit metrics map
with a Hive lifecycle hook and lets the metrics registration depend
on the map. This way it's ensured that the map is created when
the metrics collection starts.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/fix-ratelimitmap-lifecycle branch from cb1bc60 to c591efb Compare May 6, 2025 11:53
@mhofstetter mhofstetter changed the title ratelimitmap: init maps with Hive lifecycle hook ratelimitmap: init maps before metrics collection May 6, 2025
@mhofstetter
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 7, 2025
@joamaki joamaki added this pull request to the merge queue May 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 7, 2025
@joamaki joamaki added this pull request to the merge queue May 7, 2025
Merged via the queue into cilium:main with commit 1ec1b11 May 7, 2025
69 of 70 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/fix-ratelimitmap-lifecycle branch May 7, 2025 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/modularization Relates to code modularization and maintenance. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants