-
Notifications
You must be signed in to change notification settings - Fork 20
Description
In v1.15.0 we observed an issue were HTTP retrievals would suddenly and randomly stopped for some endpoints.
The fixes are in the latest boxo release and will be included in v1.17.0. This is a small writeup from the rainbow point of view.
The root cause was unordered delivery of peer Connect/Disconnect events to the peer manager, coupled with the dual bitswap/http stack. For example, when the HTTP stack thought it had successfully connected to a peer and notified the PeerManager, a previous or spurious disconnect event was delivered to it, resulting in the peer manager removing the Peer queues and the sudden stop of all HTTP retrievals from that peer. Since the HTTP stack was "connected", no "connect" notification was sent again to correct the issue. The fixes involved a number of improvements to make sure there's a unified and ordered view of connects and disconnects, which touched a number of components in order to harden all theoretical places that could trigger it.
The issue was there since the enabling of HTTP retrieval, but was exacerbated by the introduction of disconnects on "dont haves", which were meant to relieve peers that don't have most blocks from being constantly and optimistically queried. These disconnects opened the window for race conditions resulting in the issue.
The issue manifested itself only in the face of such race conditions (a disconnect followed or a concurrent). It was difficult to trigger (from days to weeks) and mostly manifested itself for endpoints with high traffic were these events happen often. It only affected specific endpoint/peers.
Unfortunately, an initial attempt to mitigate the issue introduced a bad goroutine leak in v1.16.0. The release was later withdrawn while additional testing and fixes were developed in boxo.