zebra: fix stale NHG in kernel (backport #18899) #19085

mergify · 2025-06-24T16:23:44Z

Fixing stale NHG issue in kernel.

Issue1:

zebra creates an nhe and sets 'initial delay' flag for the nexthop received along with kernel/connected route and this routes is a v6 route.
Later zebra receives intf_address event for the interface that belongs to the same nhe created above. but this is v4 event. Then zebra iterates through the nhe set linked to this interface and eventually it will end up installing this nhe in kernel

So, we install the NHG in kernel for connected/kernel routes and that looks to be deviating from the expected behaviour. All this happens when we receive interface event, we attempt a reinstall for all the NHGs associated with that intf. But if the 'initial delay' is already set for an NHG, we can skip that.
Fixing the same.

Issue2:
During FRR restart nexthop-group entries are not getting cleaned up in
below scenario.

Let's say an NHG refcnt is getting decremented and it becomes zero. we
add a timer for this NHG before deleting it in zebra/kernel.
so this NHG will be intact in kernel until the timer expires.
Now, the timer is running and frr is getting restarted. All the
NHGs are getting cleaned up in kernel but the one that has timer
running is still installed in the kernel.

Check if any NHG has timer running during zebra shutdown and remove from
kernel.

This is an automatic backport of pull request #18899 done by Mergify.

I see this issue during below events sequencing 1. zebra creates an nhe and sets 'initial delay' flag for the nexthop received along with kernel/connected route and this routes is a v6 route. 2. Later zebra receives intf_address event for the interface that belongs to the same nhe created above. but this is v4 event. Then zebra iterates through the nhe set linked to this interface and eventually it will end up installing this nhe in kernel So, we install the NHG in kernel for connected/kernel routes and that looks to be deviating from the expected behaviour. All this happens when we receive interface event, we attempt a reinstall for all the NHGs associated with that intf. But if the 'initial delay' is already set for an NHG, we can skip that. Fixing the same. Signed-off-by: Krishnasamy <krishnasamyr@nvidia.com> (cherry picked from commit d7f6d95)

During FRR restart nexthop-group entries are not getting cleaned up in below scenario. 1. Let's say an NHG refcnt is getting decremented and it becomes zero. we add a timer for this NHG before deleting it in zebra/kernel. so this NHG will be intact in kernel until the timer expires. 2. Now, the timer is running and frr is getting restarted. All the NHGs are getting cleaned up in kernel but the one that has timer running is still installed in the kernel. Check if any NHG has timer running during zebra shutdown and remove from kernel. Signed-off-by: Krishnasamy <krishnasamyr@nvidia.com> (cherry picked from commit 0743cca)

krishna-samy added 2 commits June 24, 2025 16:23

mergify bot mentioned this pull request Jun 24, 2025

zebra: fix stale NHG in kernel #18899

Merged

frrbot bot added bugfix zebra labels Jun 24, 2025

github-actions bot added size/M dev/10.4 labels Jun 24, 2025

Jafaral merged commit 5362da4 into dev/10.4 Jun 24, 2025
15 checks passed

Jafaral deleted the mergify/bp/dev/10.4/pr-18899 branch July 31, 2025 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zebra: fix stale NHG in kernel (backport #18899) #19085

zebra: fix stale NHG in kernel (backport #18899) #19085

Uh oh!

mergify bot commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

zebra: fix stale NHG in kernel (backport #18899) #19085

zebra: fix stale NHG in kernel (backport #18899) #19085

Uh oh!

Conversation

mergify bot commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!