Skip to content

Conversation

raja-rajasekar
Copy link
Contributor

@raja-rajasekar raja-rajasekar commented Apr 24, 2025

NHG improvements + fixes along with

zebra: Re-install all route entries upon kernel NHG action/Quick intf flaps

When operations such as the below happens,

  • ip nexthop flush
  • ip nexthop del id <>
  • quick flap of interface

there are cases where the routes are re-installed when the nexthop-group
isnt yet installed. This will lead to route install failures which isnt
re-tried there by out of sync issues between kernel and zebra.

Take ip nexthop flush for example:
Say we have 10 routes using NHG 100 [50,60]. Upon flush, kernel cleans
up all the route + nexthop entries but notifies zebra only about the
nexthop delete event. Sequence of events could be

  • Kernel sends 100 NHG delete
  • Kernel sends 50 NHG delete
  • Kernel sends 60 NHG delete

When we receive NHG 100 delete, zebra immediately sees "Oh! routes still
point to it, let me reinstall NHG 100" but then kernel doesnt know about
the NHG 50 and 60 yet (so NHG 100 installation fails as expected)..
Only after the NHG's 50 and 60 are installed, the NHG 100 is installed
successfully. But in this timeframe, route installation(new/updated)
fails since NHG 100 was never installed and then it will never be
attempted to re-install.

Fix here is to maintain a rb tree of all the route-entries which uses
a nhe with the head of the tree in the struct nhg_hash_entry.
When the nhe is installed, we walk this tree and reinstall all the re's
which were SELECTED for install at that point in time. Post the NHG install,
the route installation will be successful as anyways.

It also takes care of unsetting the INSTALLED flag in re in case of
events where a NHG is to be reinstalled/going away.

For future purposes, we can use this nhe->re_head tree as per needed.

Also added a new show command

r1# sh nexthop-group rib 40 routes
Routes using nexthop group 40:
Route Entry                    Status          Protocol    AFI      SAFI
------------------------------ --------------- ----------  -------  -------
4.5.6.10/32                    installed       static        IPv4     unicast
4.5.6.11/32                    installed       static        IPv4     unicast
4.5.6.15/32                    not installed   static        IPv4     unicast
4.5.6.16/32                    installed       static        IPv4     unicast
4.5.6.17/32                    installed       static        IPv4     unicast

@frrbot frrbot bot added the zebra label Apr 24, 2025
@raja-rajasekar raja-rajasekar marked this pull request as draft April 24, 2025 18:53
@raja-rajasekar
Copy link
Contributor Author

While writing topotests for this in the tests/topotests/all_protocol_startup/ came across a issue with static-zebra . Working on this issue...

Staticd has a route 1.1.1.1/32 and a nexthop 1.1.1.1/32(for route 6.6.6.0)
Qucik Interface flap/nexthop flush

  • staticd deleted and adds 1.1.1.1/32 to zebra which is ACKd back to staticd (What we saw about notifying owner)
  • but from zebra pov, there is no nexthop change in 1.1.1.1/32(because of the fast flap) that it doesn't notify about NHT to staticd. (Not saw the notifying client nht log)
  • So staticd's route 6.6.6.0 which depends on nh 1.1.1.1/32 isn't changed

Worst case I will write a new topo test rather than modifying existing.

@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch 3 times, most recently from a88b62d to 4e0ba30 Compare April 24, 2025 19:44
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch 3 times, most recently from dd0de7c to 6765423 Compare May 2, 2025 15:20
Copy link

github-actions bot commented May 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from 6765423 to 6a19156 Compare May 6, 2025 21:52
@github-actions github-actions bot removed the conflicts label May 6, 2025
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch 8 times, most recently from 336c6cf to c23e450 Compare May 9, 2025 17:18
@raja-rajasekar raja-rajasekar marked this pull request as ready for review May 9, 2025 17:18
@raja-rajasekar raja-rajasekar requested a review from ton31337 May 9, 2025 17:20
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from c23e450 to b4ec853 Compare May 9, 2025 18:04
@raja-rajasekar
Copy link
Contributor Author

Sample Logs

2025/05/09 05:07:24.117009 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (11[if 2 vrfid 0]) that we are still using for a route, sending it back down
2025/05/09 05:07:24.117023 ZEBRA: [GH501-R95T3] zebra_nhg_handle_kernel_state_change Kernel Event: Unsetting all dependent re's INSTALLED flag for NHE 0xaf6054721680 (11[if 2 vrfid 0]) flags (0x3)
2025/05/09 05:07:24.117055 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
2025/05/09 05:07:24.120230 ZEBRA: [GA77T-1SQSD] zebra_nhg_dplane_result Route re-install for re 0xaf6054717f20 (192.168.0.0/24, connected) on nhe 11 (0xaf6054721680)
2025/05/09 05:07:24.120233 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
2025/05/09 05:07:24.120599 ZEBRA: [GA77T-1SQSD] zebra_nhg_dplane_result Route re-install for re 0xaf6054718240 (192.168.0.0/24, connected) on nhe 11 (0xaf6054721680)
2025/05/09 05:07:24.120616 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
2025/05/09 05:07:24.120619 ZEBRA: [GA77T-1SQSD] zebra_nhg_dplane_result Route re-install for re 0xaf605461ef70 (192.168.0.1/32, local) on nhe 11 (0xaf6054721680)
2025/05/09 05:07:24.120620 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
2025/05/09 05:07:24.120621 ZEBRA: [GA77T-1SQSD] zebra_nhg_dplane_result Route re-install for re 0xaf605461f230 (192.168.0.1/32, local) on nhe 11 (0xaf6054721680)
2025/05/09 05:07:24.120622 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
2025/05/09 05:07:24.120623 ZEBRA: [GA77T-1SQSD] zebra_nhg_dplane_result Route re-install for re 0xaf605460af20 (4.5.6.12/32, static) on nhe 11 (0xaf6054721680)
2025/05/09 05:07:24.120624 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 11[if 2 vrfid 0]
root@rajasekarr:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# 

Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go over your code and fix all the similar minor nits I marked (there are more, but didn't mark all of them).

@@ -175,8 +178,35 @@ struct route_entry {
*/
struct nexthop_group fib_ng;
struct nexthop_group fib_backup_ng;

/* Selected re using an nhe are in its re RB tree */
struct nhe_re_tree_item re_item;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for state keeping of route<->nexthop is keeping route->nhe better that nhe->re ? wouldnt nhe->re have lesser iterations?

/* This is the first time it's being attached */
route_entry_attach_ref(re, new_nhghe);
nhe_add_or_del_re_tree(new_nhghe, re, __func__, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tracked only on failure or for every re install?

@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from d016979 to f52ceec Compare June 6, 2025 04:47
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from f52ceec to c58e911 Compare June 12, 2025 21:56
@github-actions github-actions bot added size/XXL and removed size/XL labels Jun 12, 2025
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from c58e911 to 412ed81 Compare June 12, 2025 22:02
Ability to have a back ptr from re to rn. To be used by future commits

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
… flaps

When operations such as the below happens,
 - ip nexthop flush
 - ip nexthop del id <>
 - quick flap of interface

there are cases where the routes are re-installed when the nexthop-group
isnt yet installed. This will lead to route install failures which isnt
re-tried there by out of sync issues between kernel and zebra.

Take ip nexthop flush for example:
Say we have 10 routes using NHG 100 [50,60]. Upon flush, kernel cleans
up all the route + nexthop entries but notifies zebra only about the
nexthop delete event. Sequence of events could be
 - Kernel sends 100 NHG delete
 - Kernel sends 50 NHG delete
 - Kernel sends 60 NHG delete

When we receive NHG 100 delete, zebra immediately sees "Oh! routes still
point to it, let me reinstall NHG 100" but then kernel doesnt know about
the NHG 50 and 60 yet (so NHG 100 installation fails as expected)..
Only after the NHG's 50 and 60 are installed, the NHG 100 is installed
successfully. But in this timeframe, route installation(new/updated)
fails since NHG 100 was never installed and then it will never be
attempted to re-install.

Fix here is to maintain a rb tree of all the route-entries which uses
a nhe with the head of the tree in the struct nhg_hash_entry.
When the nhe is installed, we walk this tree (and its dependent NHE tree
if needed) and reinstall all the re's which were SELECTED for install at
that point in time. Post the NHG install, the route installation will be
successful as anyways.

It also takes care of unsetting the INSTALLED flag in re in case of
events such as ip nexthop delete/flush or quick link flaps.

For future purposes, we can use this nhe->re_head tree as per needed.

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Adding a new entry to the existing show command
"show nexthop-group rib <ID> routes" which will
walk all the re's which uses that particular NHG and displays the route
node, protocol and the installed status.

Examples:
r1# sh nexthop-group rib 40 routes
Routes using nexthop group 40:
Route Entry                    Status          Protocol    AFI      SAFI
------------------------------ --------------- ----------  -------  -------
4.5.6.10/32                    installed       static        IPv4     unicast
4.5.6.11/32                    installed       static        IPv4     unicast
4.5.6.15/32                    not installed   static        IPv4     unicast
4.5.6.16/32                    installed       static        IPv4     unicast
4.5.6.17/32                    installed       static        IPv4     unicast

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
1) Adding NHG flags to debugs at few places
2) Making some debugs as IS_ZEBRA_DEBUG_NHG_DETAIL instead of
   IS_ZEBRA_DEBUG_NHG

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Add re's flags and status in the show ip/ipv6 route <> (json) command

r1# sh ip route 3.3.3.1/32
Routing entry for 3.3.3.1/32
  Known via "sharp", distance 150, metric 0, best
  Last update 00:00:49 ago
  Flags: Recursion Selected
  Status: Installed
    2.2.2.1 (recursive), weight 1
  *   1.1.1.1, via r1-eth1 onlink, weight 1
  *   1.1.1.2, via r1-eth2 onlink, weight 1

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Adding few other flags to be displayed in the show nexthop-group rib
command.

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Add macros to tell if an NHE is Singleton (recursive + direct), direct
singleton and recursive singleton

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
To make life a bit easy, using the NHG SINGLETON macros where applicable

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
…kernel in sync

Consider the following scenario
Initial state:
 - NHG 112[92/94], 92(eth1) and 94(eth2)
 - 3.3.3.1/32 uses NHG 112 , so zebra and kernel has 112 with 2
   nexthops (92 and 94)

Trigger: Flap eth1-4 quickly..

Final state: (Without fix):
 - 3.3.3.1/32 uses NHG 112 and kernel has 112 with 1 nexthop (92) while
   zebra has NHG 112 with 2 nexthops (92 and 94)

Below is the sequence of events
 1. Upon eth1 and eth2 down, we mark the 92 and 94 Invalid.
 2. eth1 comes up
 3. 92 and 112 have REINSTALL flag set
 4. 92 installed into kernel
 5. dplane nhg result of 92 will queue install of 112[92 Valid, 94 Invalid]
 6. 112 is Installed into kernel (has 112 with 1 NH (92))
 7. Before 112[92 Valid] install result is handled,  eth2 comes up
    7a) So we mark 94 as valid. Also, at this point, 94 and 112 have
        REINSTALL and VALID flag set --> 112[92 Valid, 94 Valid]
 8. NHG install result of 112[92V, 94I] (step 6) is handled and we
    unset REINSTALL and QUEUED flags and set INSTALLED flag.
 9. Now 94 is installed and as part of NHG install result handling, we
     try to install 112.
10. However since 112 does not have REINSTALL and has INSTALLED flag set,
    it wont be queued to install.

This means zebra thinks it installed 112[92V, 94V] while kernel only has 112[92]

Logs:
2025/05/30 07:19:14.153717 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for r1-eth1(3)
2025/05/30 07:19:14.154678 ZEBRA: [ZHFB4-0MJKD] zebra_interface_nhg_reinstall install nhe 92[1.1.1.1 if 3 vrfid 0] nh type 3 flags 0x5
2025/05/30 07:19:14.154678 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 92[1.1.1.1 if 3 vrfid 0]
2025/05/30 07:19:14.154681 ZEBRA: [ZJTBR-T51KC] zebra_interface_nhg_reinstall dependent nhe 112[92/94] flags (0x100) Setting Reinstall flag
2025/05/30 07:19:14.154810 ZEBRA: [J7K9Z-9M7DT] Nexthop dplane ctx 0xaaaab8422640, op NH_INSTALL, nexthop ID (92), result SUCCESS
2025/05/30 07:19:14.154840 ZEBRA: [MVMV2-45SG2] zebra_nhg_handle_install nh id 92 (flags 0x3) associated dependent NHG 112[92/94] install
2025/05/30 07:19:14.157060 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for r1-eth2(4)

//Installed 112 with 92 Valid
2025/05/30 07:19:14.157521 ZEBRA: [JAS4D-NCWGP] nlmsghdr [len=44 type=(104) NEWNEXTHOP flags=(0x0501) {REQUEST,DUMP,(ROOT|REPLACE|CAPPED),(ATOMIC|CREATE)} seq=1345 pid=3546608047]
2025/05/30 07:19:14.157521 ZEBRA: [WCX94-SW894]   nhm [family=(0) AF_UNSPEC scope=(0) UNIVERSE protocol=(11) ZEBRA flags=0x00000000 {}]
2025/05/30 07:19:14.157521 ZEBRA: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(1) ID]
2025/05/30 07:19:14.157522 ZEBRA: [Z4E9C-GD9EP]       112
2025/05/30 07:19:14.157522 ZEBRA: [KFBSR-XYJV1]     rta [len=12 (payload=8) type=(2) GROUP]
2025/05/30 07:19:14.157522 ZEBRA: [HD3EC-5WE2D]       id 92 weight 0

2025/05/30 07:19:14.157841 ZEBRA: [PJXNX-HZST8] zebra_nhg_install_kernel: valid flag set for nh 94[1.1.1.2 if 4 vrfid 0]
2025/05/30 07:19:14.157844 ZEBRA: [ZJTBR-T51KC] zebra_interface_nhg_reinstall dependent nhe 112[92/94] flags (0x105) Setting Reinstall flag
2025/05/30 07:19:14.157918 ZEBRA: [J7K9Z-9M7DT] Nexthop dplane ctx 0xaaaab8411e80, op NH_INSTALL, nexthop ID (112), result SUCCESS

//Now 94 installation kicks 112 installation which is ignore due to flags state
2025/05/30 07:19:14.158048 ZEBRA: [J7K9Z-9M7DT] Nexthop dplane ctx 0xaaaab84231c0, op NH_INSTALL, nexthop ID (94), result SUCCESS
2025/05/30 07:19:14.158112 ZEBRA: [MVMV2-45SG2] zebra_nhg_handle_install nh id 94 (flags 0x3) associated dependent NHG 112[92/94] install

Fix:
Force install dependent NHGs in the dplane NHG result of singleton
install in all cases. This can lead to reinstalling an already installed
NHG with no change, but that is ok to live with as long as zebra and
kernel are in sync.

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
Current code has issues where (for example) when an interface goes down,
we are supposed to UNSET all the impacted nexthops in the NHGs
dependent chain as ACTIVE and UNSET the NHG as VALID (if no more active
nhs).

However there are bugs in the existing code. One such example
 - NHG 91(r1-eth1) and NHG 92(r1-eth2)
 - NHG 114 depends are 91 and 93
 - NHG 113 depends 114
i.e. [91,93]->114->113

If we put down the Intf r1-eth1 and r1-eth2, we are supposed to UNSET 113
and 114 NHs as ACTIVE and NHGs as INVALID. However, we are comparing
things wrong. We are supposed to compare nexthop of 91 across the entire
chain i.e. compare nh of 91 (1.1.1.1) with 113, compare nh of 91
(1.1.1.1) with 114. But the code doesnt do that.

Initial State:
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# ip route show | grep 3.3.3.1
3.3.3.1 nhid 114 proto 194 metric 20

root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# vtysh -c "sh nexthop-group rib 114"
ID: 114 (zebra)
     Nexthop Count: 2
     Valid, Installed
     Depends: (91) (93)
           via 1.1.1.1, r1-eth1 (vrf default) onlink, weight 1
           via 1.1.1.2, r1-eth2 (vrf default) onlink, weight 1
     Dependents: (113)
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# vtysh -c "sh nexthop-group rib 113"
ID: 113 (zebra)
     Valid
     Depends: (114)
        via 2.2.2.1 (vrf default) (recursive), weight 1
           via 1.1.1.1, r1-eth1 (vrf default) onlink, weight 1
           via 1.1.1.2, r1-eth2 (vrf default) onlink, weight 1
     Dependents: (121)

Trigger:
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# ip link set down r1-eth1
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# ip link set down r1-eth2

Final State:
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# vtysh -c "sh nexthop-group rib 114"
ID: 114 (zebra)
     Depends: (91) (93)
           via 1.1.1.1, r1-eth1 (vrf default) inactive onlink, weight 1
           via 1.1.1.2, r1-eth2 (vrf default) inactive onlink, weight 1
     Dependents: (113)
root@r1:/tmp/topotests/all_protocol_startup.test_all_protocol_startup/r1# vtysh -c "sh nexthop-group rib 113"
ID: 113 (zebra)
     Valid ---> WRONG
     Depends: (114)
        via 2.2.2.1 (vrf default) (recursive), weight 1
           via 1.1.1.1, r1-eth1 (vrf default) onlink, weight 1 --> WRONG
           via 1.1.1.2, r1-eth2 (vrf default) onlink, weight 1 --> WRONG
     Dependents: (121) (259)

There are many such cases where the propogation of ACTIVE(NH) and
VALID(NHG) is broken.(Easily and especially when we have NHG using recursive NHs)

Fixing this by rewriting the zebra_nhg_check/set_valid function to
propogate the dependency chain for singleton(direct/recursive) and
groups for triggers such as NHG add/del or interface flaps/del.

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch 2 times, most recently from ee95583 to 5efad47 Compare June 13, 2025 02:05
…propogation

1) Add test cases for ip nexthop flush/del
2) Add test cases for Quick Interface flaps
3) Add test cases for Checking the NHG dependency chain propogation for
   ACTIVE and VALID flags.
4) Remove sleep and add retry mechanisms
5) Add retry logic to other tests

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from 5efad47 to 11f7bbf Compare June 13, 2025 02:06
@raja-rajasekar raja-rajasekar changed the title zebra: Re-install all route entries upon kernel NHG action/Quick intf flaps zebra: Re-install all route entries upon kernel NHG action/Quick intf flaps + NHG fixes and improvements Jun 13, 2025
In the existing topotest isis_tilfa_topo1-step10, the existing script
immediately expects an output which matches and hence it was working all
along. However, if we add some sleep post the interface down,
the script fails as expected i.e. in other words, we dont wait for enough
time for zebra to handle the interface down event to change the NHGs to
backup.

With the NHG route reinstall logic in place, things happen quickly and
that is i figured the existing script itself has issues.

Fixing the expectation of show ip/ipv6 route isis json.

Signed-off-by: Rajasekar Raja <rajasekarr@nvidia.com>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/fix_nexthop_flush branch from 00ea389 to e0ea389 Compare June 16, 2025 14:33
@raja-rajasekar raja-rajasekar requested a review from ton31337 June 17, 2025 02:20
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@raja-rajasekar raja-rajasekar marked this pull request as draft August 29, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflicts master rebase PR needs rebase size/XXL tests Topotests, make check, etc zebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants