Skip to content

Conversation

donaldsharp
Copy link
Member

@donaldsharp donaldsharp commented Mar 28, 2024

See 2nd commit for the real meat.

Effectively w/ suppress-fib-pending a withdrawal will not be sent to a peer if BGP is already in the process of waiting for a previous route install for the same route.
ie suppose you have 2 way ecmp w/ bgp-suppress-fib-pending. peer 1 withdraws, bgp marks the route for FIB_INSTALLING and does not send an update. Then if peer 2 withdraws it's route before zebra gives notice, when attempting to send the route via group_announce_route it sees the FIB_INSTALLING flag and does not send the withdrawal.

Closes: #15626

When BGP has been asked to wait for FIB installation, on route
removal a return call is likely to not have the dest since BGP
will have cleaned up the node, entirely.  Let's just note that
the prefix cannot be found if debugs are turned on and move on.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently BGP attempts to send route change information
to it's peers *before* the route is installed into zebra.
This creates a bug in suppress-fib-pending in the following
scenario:

a) bgp suppress-fib-pending and bgp has a route with
2 way ecmp.
b) bgp receives a route withdraw from peer 1.  BGP
will send the route to zebra and mark the route as
FIB_INSTALL_PENDING.
c) bgp receives a route withdraw from peer 2.  BGP
will see the route has the FIB_INSTALL_PENDING and
not send the withdrawal of the route to the peer.
bgp will then send the route deletion to zebra and
clean up the bgp_path_info's.

At this point BGP is stuck where it has not sent
a route withdrawal to downstream peers.

Let's modify the code in bgp_process_main_one to
send the route notification to zebra first before
attempting to announce the route.  The route withdrawal
will remove the FIB_INSTALL_PENDING flag from the dest
and this will allow group_announce_route to believe
it can send the route withdrawal.

For the master branch this is ok because the recent
backpressure commits are in place and nothing is going
to change from an ordering perspective in that regards.
Ostensibly this fix is also for operators of Sonic and
will be backported to the 8.5 branch as well.  This will
change the order of the send to peers to be after the
zebra installation but sonic users are using suppress-fib-pending
anyways so updates won't go out until rib ack has been
received anyways.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@ton31337
Copy link
Member

@Mergifyio backport dev/10.0 stable/9.1 stable/9.0 stable/8.5

Copy link

mergify bot commented Mar 28, 2024

backport dev/10.0 stable/9.1 stable/9.0 stable/8.5

✅ Backports have been created

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@ton31337 ton31337 merged commit 27cc9ae into FRRouting:master Apr 3, 2024
ton31337 added a commit that referenced this pull request Apr 3, 2024
ton31337 added a commit that referenced this pull request Apr 4, 2024
ton31337 added a commit that referenced this pull request Apr 4, 2024
@f0o
Copy link

f0o commented Apr 19, 2024

I believe this added regression.

I was running FRR 8.1 until yesterday when I had to update to 10.x due to an AddPath issue crashing sessions.

However now with 10.x I do see similar symptoms as outlined in #15626 - even worse, I see that routes are added to the FIB and yet not announced to peers. (#15786 (comment))

Disabling suppress-fib-pending causes intermittent network loss (anecdotal, we received alerts around 1am tonight and are still going though logs) but fixed the issue of prefixes not being advertised to peers.

Worth to mention that 8.1 did not have this issue.

//Edit:

Further debugging of logs and bgp routing tables show that the network loss might not be related to this, we've set up continuous network probes and bgp table snapshots to get a bigger picture.

StormLiangMS pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 29, 2024
Why I did it
Upgrading FRR 8.5.4 to include latest fixes.

Work item tracking
Microsoft ADO (number only):
How I did it
New patches that were added:

Patch	FRR Pull request	Issue fixed
0024-lib-use-snmp-s-large-fd-sets-for-agentx.patch	FRRouting/frr#13396	FRRouting/frr#14143
0025-bgp-community-memory-leak-fix.patch	FRRouting/frr#15466	FRRouting/frr#15459
0026-bgp-fib-suppress-announce-fix.patch	FRRouting/frr#15634	FRRouting/frr#15626
0027-lib-Do-not-convert-EVPN-prefixes-into-IPv4-IPv6-if-n.patch	FRRouting/frr#15418	FRRouting/frr#14419
Removed patches:

Patch	Upstream FRR commit that is present in 8.5.4
0019-zebra-Abstract-dplane_ctx_route_init-to-init-route-w.patch	FRRouting/frr@3f01977
0020-zebra-Fix-crash-when-dplane_fpm_nl-fails-to-process-.patch	FRRouting/frr@fe5f624
0022-bgpd-Don-t-read-the-first-byte-of-ORF-header-if-we-a.patch	FRRouting/frr@3515178
0023-bgpd-Make-sure-we-have-enough-data-to-read-two-bytes.patch	FRRouting/frr@460ee93
0024-bgpd-Do-not-process-NLRIs-if-the-attribute-length-is.patch	FRRouting/frr@f291f1e
0025-bgpd-Use-treat-as-withdraw-for-tunnel-encapsulation-.patch	FRRouting/frr@8a4a88c
0026-zebra-Add-encap-type-when-building-packet-for-FPM.patch	FRRouting/frr@f0f7b28
0028-bgpd-Check-mandatory-attributes-more-carefully-for-U.patch	FRRouting/frr@21418d6
0029-bgpd-Handle-MP_REACH_NLRI-malformed-packets-with-ses.patch	FRRouting/frr@30b5c2a
0030-bgpd-Treat-EOR-as-withdrawn-to-avoid-unwanted-handli.patch	FRRouting/frr@01f232c
0031-bgpd-Ignore-handling-NLRIs-if-we-received-MP_UNREACH.patch	FRRouting/frr@a0c4ec2
0032-zebra-Fix-fpm-multipath-encap-addition.patch	FRRouting/frr@10a9a5f
Realigned patches:

Old Patch	New patch
0005-Add-support-of-bgp-l3vni-evpn.patch	0005-Add-support-of-bgp-l3vni-evpn.patch
0021-zebra-remove-duplicated-nexthops-when-sending-fpm-msg.patch	0019-zebra-remove-duplicated-nexthops-when-sending-fpm-msg.patch
0027-zebra-Fix-non-notification-of-better-admin-won.patch	0020-zebra-Fix-non-notification-of-better-admin-won.patch
Disable-ipv6-src-address-test-in-pceplib.patch	0021-Disable-ipv6-src-address-test-in-pceplib.patch
cross-compile-changes.patch	0022-cross-compile-changes.patch
0033-zebra-The-dplane_fpm_nl-return-path-leaks-memory.patch	0023-zebra-The-dplane_fpm_nl-return-path-leaks-memory.patch
How to verify it
Running sonic-mgmt test suite.
@donaldsharp donaldsharp deleted the suppress_fib_funny_business branch July 30, 2025 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[frr 8.5.1] orphan routes with bgp suppress-fib-pending
4 participants