Skip to content

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Jun 24, 2025

Fixing stale NHG issue in kernel.

Issue1:

  1. zebra creates an nhe and sets 'initial delay' flag for the nexthop received along with kernel/connected route and this routes is a v6 route.
  2. Later zebra receives intf_address event for the interface that belongs to the same nhe created above. but this is v4 event. Then zebra iterates through the nhe set linked to this interface and eventually it will end up installing this nhe in kernel

So, we install the NHG in kernel for connected/kernel routes and that looks to be deviating from the expected behaviour. All this happens when we receive interface event, we attempt a reinstall for all the NHGs associated with that intf. But if the 'initial delay' is already set for an NHG, we can skip that.
Fixing the same.

Issue2:
During FRR restart nexthop-group entries are not getting cleaned up in
below scenario.

  1. Let's say an NHG refcnt is getting decremented and it becomes zero. we
    add a timer for this NHG before deleting it in zebra/kernel.
    so this NHG will be intact in kernel until the timer expires.
  2. Now, the timer is running and frr is getting restarted. All the
    NHGs are getting cleaned up in kernel but the one that has timer
    running is still installed in the kernel.

Check if any NHG has timer running during zebra shutdown and remove from
kernel.


This is an automatic backport of pull request #18899 done by Mergify.

I see this issue during below events sequencing
1. zebra creates an nhe and sets 'initial delay' flag for the nexthop
   received along with kernel/connected route and this routes is a v6
   route.
2. Later zebra receives intf_address event for the interface that
   belongs to the same nhe created above. but this is v4 event. Then
   zebra iterates through the nhe set linked to this interface and
   eventually it will end up installing this nhe in kernel

So, we install the NHG in kernel for connected/kernel routes and that
looks to be deviating from the expected behaviour.
All this happens when we receive interface event, we attempt a reinstall
for all the NHGs associated with that intf. But if the 'initial delay'
is already set for an NHG, we can skip that.
Fixing the same.

Signed-off-by: Krishnasamy <krishnasamyr@nvidia.com>
(cherry picked from commit d7f6d95)
During FRR restart nexthop-group entries are not getting cleaned up in
below scenario.

1. Let's say an NHG refcnt is getting decremented and it becomes zero. we
add a timer for this NHG before deleting it in zebra/kernel.
so this NHG will be intact in kernel until the timer expires.
2. Now, the timer is running and frr is getting restarted. All the
NHGs are getting cleaned up in kernel but the one that has timer
running is still installed in the kernel.

Check if any NHG has timer running during zebra shutdown and remove from
kernel.

Signed-off-by: Krishnasamy <krishnasamyr@nvidia.com>
(cherry picked from commit 0743cca)
@Jafaral Jafaral merged commit 5362da4 into dev/10.4 Jun 24, 2025
15 checks passed
@Jafaral Jafaral deleted the mergify/bp/dev/10.4/pr-18899 branch July 31, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants