Skip to content

Conversation

mvisonneau
Copy link
Member

@mvisonneau mvisonneau commented Dec 19, 2023

This change introduces two new metrics to monitor the amount of bpf maps and programs currently defined on the system.

We recently discovered that abnormally large amounts of maps could be found on some of our nodes.

Additionally, this change introduces request concurrency limit as well as dynamic definition of the metrics list throughout the Describe() function.

pkg/metrics/bpf: new bpf_maps & bpf_progs metrics

@mvisonneau mvisonneau requested a review from a team as a code owner December 19, 2023 11:54
@mvisonneau mvisonneau requested a review from derailed December 19, 2023 11:54
@maintainer-s-little-helper
Copy link

Commits 72016fa, f5f1ab4 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 19, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 19, 2023
@maintainer-s-little-helper
Copy link

Commit 72016fa does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 19, 2023
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvisonneau LGTM Thank you for this update!

@ti-mo ti-mo added area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/misc This PR makes changes that have no direct user impact. labels Feb 7, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 7, 2024
@ti-mo ti-mo requested a review from a team February 7, 2024 14:04
@mvisonneau
Copy link
Member Author

mvisonneau commented Feb 8, 2024 via email

@ti-mo
Copy link
Contributor

ti-mo commented Feb 22, 2024

@mvisonneau Some tests and linters are failing, please take a look.

@mvisonneau
Copy link
Member Author

mvisonneau commented Feb 22, 2024

👋 hey @ti-mo, indeed my bad 🙈 , I've rebased and hopefully also have sorted all failing tests!

@ldelossa
Copy link
Contributor

ldelossa commented Mar 1, 2024

/test

@ldelossa
Copy link
Contributor

ldelossa commented Mar 1, 2024

@mvisonneau

There seems to be two consistently failing tests here:
https://github.com/cilium/cilium/actions/runs/8003368791?pr=29984
https://github.com/cilium/cilium/actions/runs/8114768581
https://github.com/cilium/cilium/actions/runs/8114769384

If you feel your PR is not the cause of these please attempt a rebase and we will run tests again.

@joestringer
Copy link
Member

joestringer commented Mar 4, 2024

The smoke test failures report an issue with the new metrics in the PR:

cilium_bpf_maps_count non-histogram and non-summary metrics should not have "_count" suffix
cilium_bpf_progs_count non-histogram and non-summary metrics should not have "_count" suffix
Error: Process completed with exit code 3.

(Maybe _total is a better suffix? @cilium/metrics folks could provide better guidance, but this is using standard prometheus tooling to validate the naming)

@mvisonneau mvisonneau requested a review from a team as a code owner March 6, 2024 12:51
@mvisonneau mvisonneau requested a review from tklauser March 6, 2024 12:51
@mvisonneau
Copy link
Member Author

I believe to have sorted most suggestions/issues. Let me know if there is anything else I can do! 👍 😄

@aanm
Copy link
Member

aanm commented May 2, 2024

@mvisonneau it seems that the CI is failing:

  Running [/home/runner/golangci-lint-1.57.2-linux-amd64/golangci-lint run --out-format=github-actions --out-${NO_FUTURE}format colored-line-number --verbose --modules-download-mode=vendor] in [] ...
  pkg/metrics/bpf.go:10: File is not `goimports`-ed with -local github.com/cilium/cilium/ (goimports)
  	"github.com/prometheus/client_golang/prometheus"
  	"github.com/sirupsen/logrus"
  	"golang.org/x/sync/singleflight"

@mvisonneau
Copy link
Member Author

argh 🙈 I ran make golangci-lint-fix, hopefully this time all goes through! 😄

@aanm aanm enabled auto-merge May 3, 2024 14:13
@aanm
Copy link
Member

aanm commented May 3, 2024

/test

@aanm aanm requested a review from ti-mo May 3, 2024 14:13
@aanm aanm removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label May 3, 2024
Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever, just keep in mind this will miss programs that don't refer to maps, if we ever add those. For now, this will do. 👍

auto-merge was automatically disabled May 14, 2024 08:32

Head branch was pushed to by a user without write access

@mvisonneau
Copy link
Member Author

thanks, I've just rebased, hope that all of tests will go through this time! 🤞

@joestringer
Copy link
Member

/test

@joestringer
Copy link
Member

A variety of tests have failed, I suggest triaging them to identify any potential links to the PR. You are of course welcome to rebase and retrigger CI again, but if there is an underlying issue in the PR itself then that course of action will have no impact.

Until the PR is passing CI, I'll mark this as draft. Feel free to reach out on Slack if you would like to discuss #testing issues or you need someone to retrigger the full CI. When CI is passing you can mark it "ready for review".

@joestringer joestringer marked this pull request as draft May 30, 2024 19:20
Copy link

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 30, 2024
Copy link

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Jul 14, 2024
@hemanthmalla hemanthmalla reopened this Nov 26, 2024
This change introduces two new metrics to monitor the amount of bpf maps and programs currently defined on the system.

We recently discovered that abnormally large amounts of maps could be found on some of our nodes.

Additionally, this change introduces request concurrency limit as well as dynamic definition of the metrics list throughout the Describe() function.

Signed-off-by: Maxime Visonneau <maxime.visonneau@gmail.com>
@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 27, 2024
@mvisonneau mvisonneau marked this pull request as ready for review November 27, 2024 07:56
@mvisonneau
Copy link
Member Author

Apologies for not having thoroughly followed-up on this! Seems like CI tests are passing now although we likely need a /test from a maintainer to run all of them I assume.

@aanm
Copy link
Member

aanm commented Nov 28, 2024

/test

@aanm aanm enabled auto-merge November 28, 2024 12:55
@aanm aanm disabled auto-merge November 29, 2024 09:22
@aanm aanm merged commit 33407f7 into cilium:main Nov 29, 2024
63 of 64 checks passed
@mvisonneau mvisonneau deleted the bpf_metrics branch November 29, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/community-contribution This was a contribution made by a community member. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants