Skip to content

Conversation

eqvinox
Copy link
Contributor

@eqvinox eqvinox commented Feb 11, 2025

Refactor the "timer getting executed too late" warning:

  • warning threshold is now adjustable
  • check is performed when event actually executes, rather than when it's thrown on ready list (makes more sense really)
  • ignore_late_timer replaced with threshold = 0
  • system load averages printed in log message
  • warning ratelimited to once per 10s rather than once per poll()
  • use it for BFD TX timers (sets the threshold to half TX interval)

  • freebie threadevent in user-visible formatting

Among other things this is intended to distinguish BFD failures in CI due to high system load. Doesn't fix the issue but at least makes clear what is going on.

Example warning now looks like this:

2025/02/11 11:54:31 ZEBRA: [ZXB3P-3AVT0][EC 100663315] CPU starvation: {(event *)0x7ffd081b2600 arg=0x55acad04e020 ready               rtadv_timer() &rtadv->ra_timer from zebra/rtadv.c:1741} getting called 7169ms late, warning threshold 4000ms. System load: 1.11, 2.95, 2.68

(System load average is the thing printed in uptime. The %pTHD output is a bit lengthy, but better to have that info.)

Refactor the "timer getting executed too late" warning:

- warning threshold is now adjustable
- check is performed when event actually executes, rather than when it's
  thrown on ready list
- ignore_late_timer replaced with threshold = 0
- system load averages printed in log message
- warning ratelimited to once per 10s rather than once per poll()

Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
FRR's event loop prints a warning if a timer is executed more than some
given threshold late, default being 4s.  Set a more appropriate
threshold for BFD TX timers (half TX interval).

Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Make it consistent and call it `event` when formatting something to
display.  Much less confusing for some user seeing it too, since threads
aren't involved.

Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
@@ -107,6 +107,7 @@ void sbfd_init_xmttimer_update(struct bfd_session *bs, uint64_t jitter)
tv_normalize(&tv);

event_add_timer_tv(master, sbfd_init_xmt_cb, bs, &tv, &bs->xmttimer_ev);
event_set_tardy_threshold(bs->xmttimer_ev, bs->xmt_TO / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better as a --enable-dev-build type thing? I'm afraid that a bunch of people will start complaining about this. We already get a fair number of starvation questions. This is probably going to raise the bar on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on call, I think the consensus is that it should stay and users need to see it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add something like please buy a faster CPU to the log message 😆

(though to be fair the CPU might be super fast and the system is just overloaded…)

@riw777 riw777 self-requested a review February 11, 2025 16:36
@donaldsharp donaldsharp merged commit 9623612 into FRRouting:master Feb 12, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants