p2p: Make stalling timeout adaptive during IBD #25880

mzumsande · 2022-08-19T19:44:59Z

During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (BLOCK_STALLING_TIMEOUT) and if it doesn't, disconnect it and assign the critical block we need to another peer.

Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in #9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones.

The 2s stalling timeout was introduced in #4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then.
But it would be good to revisit this timeout now.

This PR makes the timout adaptive (idea by sipa):
If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore.

Fixes #9213

sipa · 2022-08-19T19:51:37Z

Nice observation.

(Brainstorm idea) How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?

luke-jr · 2022-08-20T20:40:59Z

Collect statistics from recent block download times during IBD and have a dynamic timeout based on this. (Introduces more complexity, but might be better in certain situations, e.g. when 6s aren't sufficient either).

Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.

For an immediate fix, though, maybe just making the timeout configurable would be a good idea?

Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.

mzumsande · 2022-08-22T14:21:31Z

Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.

We would also have to compare it to the speed of others and have some criterion what deviation would be enough to disconnect.

How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?

Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.

Thanks! These suggestions are similar, make a lot of sense to me, and don't look very invasive to implement, planning to change to this approach and update soon.

mzumsande · 2022-08-24T15:06:29Z

I now implemented the suggestion by @sipa to double the timeout and updated the OP.

I tested this manually by catching up to the best chain with an ~1 month old datadir with -onlynet=tor (slow, blocks take ~10s to download), while reducing BLOCK_DOWNLOAD_WINDOW and MAX_BLOCKS_IN_TRANSIT_PER_PEER to make stalling situations happen more quickly.

ajtowns · 2022-08-26T14:31:36Z

Nice! This seems a fine improvement.

I think one way of looking at stalling is that it happens when one peer's bandwidth is less than 1/64th of the total bandwidth (64 = 1024/16 = window/max in transit) [0].

I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]

I think adding the blocks only nodes probably made this slightly worse, since there are now 2 extra peers, so now you need something like 25% more bandwidth in order to still have 1/64th of the total...

[0] Measured in blocks of course, so even if your bandwidth in bytes is fine, you might be unlucky to be asked for 16 blocks that are 2MB each, while everyone else is just being asked for 1008 50kB blocks at you (32MB total vs 7.2MB per peer).

[1] Perhaps you could implement this by keeping a global and per-peer exponential rolling average of how many blocks you download per second; then you could set the peer's in-transit limit to 1024 * peer_avg / global_avg; capping it at 16, and marking the peer as stalling and disconnecting if that value drops below 2 (in which case the remaining peers each have 50x this peer's bandwidth)?

luke-jr · 2022-08-26T20:27:05Z

src/net_processing.cpp

@@ -1723,6 +1729,9 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
            m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
        }
    }
+    if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) {


I would think only reduce the timeout if the block was in fact downloaded in less time (than the would-be new timeout).

I like about the current solution that it is very simple, prevents the node from getting stuck, and doesn't require additional bookkeeping of historical block download times.

I think your suggestion wouldn't be completely straightforward to implement: The block being connected here might have been downloaded some time in the past, saved to disk, but only connected now (as a result of its predecessor being connected). So we'd need some data structure to keep track of download times for not-yet-connected blocks and add/remove entries from it during IBD.

If we did this, it would help us cycle through less peers in situations where we assign multiple blocks to a peer and halving the timeout after successfully downloading a block would lead to a stalling situation again - but note that there are also other sources of disconnections that could be improved if we kept track of this kind data during IBD: E.g. if doubling the timeout is not sufficient and we'd need to 4x or 8x it.

So if we want something better but more complicated (with bookkeeping), my feeling is that we should go for another approach altogether, like basing the stalling timeout on a running average over the last received block times from multiple peers instead of a doubling/halving approach.

With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...

With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...

Yes, but only a few times until the blocks preventing the tip from moving are downloaded, then the tip advances by connecting the large number of stashed blocks from the 1024 window, ending the stalling situation. If every peer is equally slow, it doesn't matter if you download a block in 2s or 1 minute from the viewpoint of the stalling logic.

Maybe instead of halving the timeout, we should, on block connection, multiply it with a factor 0.5 < f < 1 to let it go back slower?

I agree it seems better to decrease it slowly. If it was a single slow peer, then there would be many blocks coming on time afterwards.

With the latest push, I decrease it by a factor 0.85 with each connected block.

mzumsande · 2022-08-30T22:30:01Z

I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]

That sounds like a very interesting alternative approach. I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place? And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?

w0xlt

ACK 7c8c4e4

vasild

Concept ACK

Some adaptivity seems to be warranted, because network throughput is rarely a constant.

src/net_processing.cpp

vasild · 2022-08-31T08:09:15Z

src/net_processing.cpp

+            m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT);
+            LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());


This sequence is not atomic. If two threads execute concurrently the increase/decrease code it would lead to unexpected results. Consider this, it only does the inc/dec if no other thread changed the value in the meantime, otherwise leaves it untouched:

atomic

diff --git i/src/net_processing.cpp w/src/net_processing.cpp index 42686f0db0..fd2f22cdcd 100644 --- i/src/net_processing.cpp +++ w/src/net_processing.cpp @@ -1726,14 +1726,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock LOCK(cs_main); for (const auto& ptx : pblock->vtx) { m_txrequest.ForgetTxHash(ptx->GetHash()); m_txrequest.ForgetTxHash(ptx->GetWitnessHash()); } } - if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) { - m_block_stalling_timeout = std::max(m_block_stalling_timeout.load() / 2, DEFAULT_BLOCK_STALLING_TIMEOUT); + auto stalling_timeout = m_block_stalling_timeout.load(); + const auto new_timeout = std::max(stalling_timeout / 2, DEFAULT_BLOCK_STALLING_TIMEOUT); + if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) { + LogPrint(BCLog::NET, "Decreased stalling timeout to %d seconds\n", new_timeout.count()); } } void PeerManagerImpl::BlockDisconnected(const std::shared_ptr<const CBlock> &block, const CBlockIndex* pindex) { // To avoid relay problems with transactions that were previously @@ -5231,22 +5233,25 @@ bool PeerManagerImpl::SendMessages(CNode* pto) } } if (!vInv.empty()) m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::INV, vInv)); // Detect whether we're stalling - if (state.m_stalling_since.count() && state.m_stalling_since < current_time - m_block_stalling_timeout.load()) { + auto stalling_timeout = m_block_stalling_timeout.load(); + if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) { // Stalling only triggers when the block download window cannot move. During normal steady state, // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection // should only happen during initial block download. LogPrintf("Peer=%d is stalling block download, disconnecting\n", pto->GetId()); pto->fDisconnect = true; // Increase timeout for the next peer so that we don't disconnect multiple peers if our own // bandwidth is insufficient. - m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT); - LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count()); + const auto new_timeout = std::min(2 * stalling_timeout, MAX_BLOCK_STALLING_TIMEOUT); + if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) { + LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", new_timeout.count()); + } return true; } // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N) // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout. // We compensate for other peers to prevent killing off peers due to our own downstream link // being saturated. We only count validated in-flight blocks so peers can't advertise non-existing block hashes

Thanks, I took your suggestion and added you as coauthor - didn't know about compare_exchange_strong before.

src/net_processing.cpp

naumenkogs · 2022-08-31T10:10:14Z

Concept ACK. I will give a code review ACK once you resolve vasil's comments :)

ajtowns · 2022-08-31T15:51:33Z

That sounds like a very interesting alternative approach.

Yeah, it shouldn't hold up this fix though, I think.

I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place?

Yes.

And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?

Not really. I think you need to keep the 1024 block window (since increasing that hurts pruning), and I think that if the window gets full you should still call that "stalling".

But I think if you change the MAX_BLOCKS_IN_TRANSIT_PER_PEER so that slower peers have fewer in-transit blocks, then you'll be stalling much less often, and may not need to disconnect them at all -- that lets you stay connected to tor peers during IBD (for partition/sybil resistance) even if your ipv4/ipv6 peers are much faster. Maybe you could disconnect them when their max_blocks_in_transit drops to 1 or 2, before they actually cause stalling?

Maybe a more specific example would be worthwhile. As it stands, if your first peer will give you one block every 5 seconds, and your other 9 peers will collectively give you 14 blocks every second (on average, 7.8 times faster than the first peer, in total 70 times faster), then by the time that first peer has downloaded blocks 1..15 (which takes 75 seconds), the other peers will have given you blocks 17..1039 after 73.1 seconds, and stalling gets triggered. But if the slow peer had only queued up 8 blocks, then it would have supplied them in 40 seconds, which only gives the other peers enough time to supply 560 blocks, so they won't fill up the window. Hmm, I guess it ought to be possible to simulate that scenario via the functional test's P2PInterface stuff...

dergoegge · 2022-09-09T09:39:25Z

Concept ACK

I think it would be nice if this PR also added some tests, because it looks like we didn't have any tests for the stalling mechanism in the first place.

mzumsande · 2022-09-12T17:50:14Z

Will address feedback soon (and work on adding a test for the stalling logic).

mzumsande · 2022-09-14T17:58:34Z

a764c20 to 48e5385:

Addressed review comments: In particular, don't halve the adaptive timeout each time a block is connected, but let it go back to the default value more slowly
Added a functional test for the stalling logic.

DrahtBot · 2022-09-15T00:07:20Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	vasild, naumenkogs, achow101
Concept ACK	dergoegge, sipa, RandyMcMillan
Stale ACK	w0xlt, luke-jr

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

dergoegge

The net processing changes look good to me, left some comments on the test.

dergoegge · 2022-09-15T13:39:28Z

test/functional/p2p_ibd_stalling.py

+                peers[-1].block_store = block_dict
+                peers[-1].send_message(headers_message)
+            self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2)
+        time.sleep(0.5)  # Wait until all blocks have arrived at the node


Not a fan of timeouts like this but in case i also don't see how to avoid it.

yes, I was looking for an RPC we could wait_until for to avoid this - but I didn't find a way of querying the number of blocks a node has downloaded (including those not connected to the chain yet).

getpeerinfo()[0]["bytesrecv_per_msg"]["block"] ?

To follow @glozow's idea, this should work:

diff --git i/test/functional/p2p_ibd_stalling.py w/test/functional/p2p_ibd_stalling.py index 78626c003b..d593187d74 100755 --- i/test/functional/p2p_ibd_stalling.py +++ w/test/functional/p2p_ibd_stalling.py @@ -79,13 +79,14 @@ class P2PIBDStallingTest(BitcoinTestFramework): with self.nodes[0].assert_debug_log([], unexpected_msgs=['Stall started']): for id in range(8): peers.append(node.add_outbound_p2p_connection(P2PStaller(stall_block), p2p_idx=id, connection_type="outbound-full-relay")) peers[-1].block_store = block_dict peers[-1].send_message(headers_message) self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2) - time.sleep(0.5) # Wait until all blocks have arrived at the node + self.wait_until(lambda: self.total_bytes_recv_for_blocks() == 172761) + self.log.info("Check that increasing the window beyond 1024 blocks triggers stalling logic") headers_message.headers = [CBlockHeader(b) for b in blocks] with self.nodes[0].assert_debug_log(expected_msgs=['Stall started peer=0']): for p in peers: p.send_message(headers_message) @@ -149,9 +150,15 @@ class P2PIBDStallingTest(BitcoinTestFramework): def total_blocks_sent(self, peers): num_blocks = 0 for p in peers: num_blocks += p.blocks_sent return num_blocks + def total_bytes_recv_for_blocks(self): + total = 0 + for info in self.nodes[0].getpeerinfo(): + total += info["bytesrecv_per_msg"]["block"] + return total + if __name__ == '__main__': P2PIBDStallingTest().main()

and even better if we can put some formula behind the magic number.

or, more pythonish:

self.wait_until(lambda: sum(e["bytesrecv_per_msg"]["block"] for e in self.nodes[0].getpeerinfo()) == 172761)

Thanks, that works!
We send 1023 blocks, the formula would be 126 * 168 + 897 * 169 = 172761 (at some point the blocks get larger by 1 byte).
Not super stable because bytesrecv_per_msg includes the extra 24 bytes magic etc. (not just the payload) - so I think the magic number would e.g. be different for BIP324 - but it's definitely better than a fixed timeout.

I ended up using the less pythonish version because we also need a check that the "block" field exists in "bytesrecv_per_msg", and I find that easier to read. Added a comment for the magic number.

test/functional/p2p_ibd_stalling.py

sipa

Concept ACK

sipa · 2022-09-15T18:01:08Z

src/net_processing.cpp

@@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
            m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
        }
    }
+    auto stalling_timeout = m_block_stalling_timeout.load();
+    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
+    const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT);


Is it intentional that this happens with an accuracy of 1 second?

So for example you could have the sequence 64, 54, 45, 38, 32, 27, 22, 18, 15, 12, 10, 8, 6, 5, 4, 3, 2... seconds.

The stalling_timeout != new_timeout condition can also be placed before the exchange, I think?

Probably should be IMO. Not sure how efficient std::atomic is with std::chrono units.

Is it intentional that this happens with an accuracy of 1 second?

It was kind of intentional, I first thought of changing m_block_stalling_timeout to microseconds, requiring conversions for the logging etc. but then I wondered whether a higher accuracy adds anything - I also thought of simply decreasing it constantly by one second per block received (instead of using a factor).
Do you or others have a preference here?

The stalling_timeout != new_timeout condition can also be placed before the exchange, I think?

Done - I changed the order within the if statements (in two places).

amovfx

Reviewed for the pr club.

src/net_processing.cpp

luke-jr · 2022-09-20T19:36:03Z

src/net_processing.cpp

@@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
            m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
        }
    }
+    auto stalling_timeout = m_block_stalling_timeout.load();
+    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value
+    const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT);


Probably should be IMO. Not sure how efficient std::atomic is with std::chrono units.

vasild

The test takes 1m45s on my laptop

That was when compiled with TSAN. Normal debug build takes only about 30 seconds. It still fails. There are some problems with the test:

First test (should not stall):
those received: block... messages in the log above (from the failure of the second test) are produced from the first test. The sleep(0.5) was apparently not enough, so I fixed it to sleep(5) (just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not see Stall started in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023 received: block messages in the log and that afterwards the stalling logic from SendMessages() is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early.
Second test (should stall):
it fails because there is no "Stall started peer=0" message. I added sleep(10) at the end of the with... block to wait even more for the stall. Then it fails with this error:

AssertionError: [node 0] Expected messages "['Stall started peer=0']" does not partially match log:

 - 2022-10-20T12:16:52.334202Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=1
 - 2022-10-20T12:16:55.712397Z [validation.cpp:3686] [ProcessNewBlockHeaders] Synchronizing blockheaders, height: 1025 (~0.17%)
 - 2022-10-20T12:16:55.712923Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=8
 - 2022-10-20T12:16:59.078114Z [net_processing.cpp:5328] [SendMessages] [net] Stall started peer=1
 - 2022-10-20T12:16:59.078214Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=4

There is a Stall started peer=1 message but it comes 7 seconds after received: headers and is for a different peer. Maybe we should instead wait for the message to appear with wait_for_debug_log() and omit the peer=0 part.

test/functional/p2p_ibd_stalling.py

vasild · 2022-10-20T13:27:22Z

I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.

mzumsande · 2022-10-21T22:26:55Z

Addressed the test feedback (will get to the outstanding comment for the main commit a bit later).

First test (should not stall):
those received: block... messages in the log above (from the failure of the second test) are produced from the first test. The sleep(0.5) was apparently not enough, so I fixed it to sleep(5) (just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not see Stall started in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023 received: block messages in the log and that afterwards the stalling logic from SendMessages() is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early.

I rewrote the test such that it doesn't use the log anymore, but waits until all blocks are received, syncs (so that a peer could get mark as a staller), waits for 3s, syncs again (so that a peer could get disconnected), and then checks that no peer gets disconnected.

Second test (should stall):
it fails because there is no "Stall started peer=0" message. I added sleep(10) at the end of the with... block to wait even more for the stall. Then it fails with this error:

I removed the peer=0 part of the check and added a missing self.all_sync_send_with_ping(peers) to the with block. With that, the tests succeeds for me even with some slow sanitizers enabled - will do more runs over the weekend to check for intermittent failures.

I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.

If everyone agrees that would be ok with me. However, the stalling logic was completely untested before, which is not ideal, so the test doesn't just cover the changes from this PR.
@dergoegge do you have an opinion, since you suggested the test? Do you think that the stalling logic could maybe better be covered by a unit test after #25515? (which would have less problems with timeouts).

This makes the stalling detection mechanism (previously a fixed timeout of 2s) adaptive: If we disconnect a peer for stalling, double the timeout for the next peer - and let it slowly relax back to its default value each time the tip advances. (Idea by Pieter Wuille) This makes situations more unlikely in which we'd keep on disconnecting many of our peers for stalling, even though our own bandwidth is insufficient to download a block in 2 seconds. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>

mzumsande · 2022-10-24T20:29:46Z

9339230 to aceff9e:
Also addressed the outstanding comments to the main commit (plus minor reformatting of comments) and fixed another source of spurious test failures - thanks for the reviews!

vasild

ACK aceff9e

test/functional/p2p_ibd_stalling.py

mzumsande · 2022-10-27T22:10:52Z

aceff9e to 39b9364: addressed feedback by @vasild - the CI failure is unrelated (I opened #26404 to fix it).

vasild

ACK 39b9364

naumenkogs

ACK 39b9364

naumenkogs · 2022-11-04T08:28:56Z

src/net_processing.cpp

+            // bandwidth is insufficient.
+            const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
+            if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
+                LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());


nit: why not use new_timeout here in logs?

It has the same meaning, unless there was some crazy concurrency, in which case the log might not make sense anyway....

It is shorted :)

It is probably more efficient

Reading the code is easier

actually, you do what i suggest in the decreasing code :)

I somehow missed this, sorry, but added it to #26982.

achow101 · 2023-01-11T22:32:26Z

ACK 39b9364

RandyMcMillan · 2023-01-11T22:51:57Z

Strong Concept ACK 39b9364

Will do some tests ASAP.

sipa · 2023-01-11T22:56:21Z

src/net_processing.cpp

@@ -1723,6 +1730,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
            m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
        }
    }
+
+    // In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value


I'm not sure we should do this every time a block is connected. Whenever a staller got disconnected, and the missing block arrived from another peer, we may suddenly be able to connect dozens of blocks at once. Performing the timeout-reduction step here 16 times suffices to get it back from the maximum 64 to the minimum 2.

I think it'd be better to drop it just once every time the download window moves, regardless of how much it moved.

Actually, thinking more about this, I don't think that's ideal either. The window will likely move many times between would-be stalls, even when the stalling timeout has adapted to be close to the "correct" value.

We should aim to be in a position where the stalling timeout is sort of in an equilibrium between triggering occasionally but not all the time. I think the best way to achieve that is to:

Increase it when it triggers due to being too low (reducing the probability of triggering in the future) [implemented]

Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

@sipa Do you think that your proposed change should be implemented in this PR or can it be done in a followup? From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.

I haven't had time to look into this feedback closely yet - but I am planning to do that next week.

I think this could be a good follow-up, but anyway, here are my thoughts.

Decrease it when it didn't trigger due to being high enough.

I think this is a very good abstract policy.

when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached.

This sounds more efficient at doing what it's supposed to do than what's implemented in this PR currently.

; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

Sounds like a good concrete policy, but not going lower than 2 seconds probably. One could do some math modeling, but I don't think it's that helpful:

with random data, there is no ground truth — one would have to rely on the human sanity check of the inputs, which we already do verbally here;

could be tested against a couple laggy nodes too, comparing between different policies, but eh.

From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.

I'd say that with its current approach, the PR improves behavior in isolated stalling situations where there are currently repeated disconnections of many/all of our peers without making any progress in getting blocks - but a previous stalling situation will not affect the behavior in future stalling situations because all memory of the previous stalling incident will be lost after a few blocks:

@sipa's suggestion would introduce a long-lasting memory of previous stalling situations

I think that one downside of this approach is that the moving window algorithm should only lead to a stalling situation if one peer is significantly slower than the rest of the peers. We'd want to replace this first peer usually - giving it more time based on previous stalling situations is probably not beneficial, because if it was comparably fast to other peers, this would not have led to this peer being marked as a staller in the first place.

The upside is in situations where the actual time to download a block for us is significantly larger than 2 seconds - we'd churn through multiple peers / timeout doublings again in every stalling situation right now until we reach the "correct" timeout, but wouldn't anymore with a longer memory.

Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.

I think this would mean moving to the decrease of the stalling timeout to ProcessMessages (NetMsgType::BLOCK) where m_stalling_since is currently reset back to 0. At this point we haven't validated the block yet or connected it to the chain, wo we'd likely would need to at least make sure that we only decrease it after receiving the actual block that allows us to extend our chain (the peer might also have been sending us another block).

@mzumsande I see. So the point isn't so much that you're trying to build something that tries to measure and converge towards an optimal long-term stalling timeout for your network conditions, but rather want something that deliberately gives a temporary "cool down" period after a stalling kick so it doesn't result in a flurry of disconnects.

So I think something like my suggestion still makes sense, but it's perhaps an orthogonal thing, and not for this PR.

39b9364 test: add functional test for IBD stalling logic (Martin Zumsande) 0565951 p2p: Make block stalling timeout adaptive (Martin Zumsande) Pull request description: During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (`BLOCK_STALLING_TIMEOUT`) and if it doesn't, disconnect it and assign the critical block we need to another peer. Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in bitcoin#9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones. The `2s` stalling timeout was introduced in bitcoin#4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then. But it would be good to revisit this timeout now. This PR makes the timout adaptive (idea by sipa): If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore. Fixes bitcoin#9213 ACKs for top commit: achow101: ACK 39b9364 RandyMcMillan: Strong Concept ACK 39b9364 vasild: ACK 39b9364 naumenkogs: ACK 39b9364 Tree-SHA512: 85bc57093b2fb1d28d7409ed8df5a91543909405907bc129de7c6285d0810dd79bc05219e4d5aefcb55c85512b0ad5bed43a4114a17e46c35b9a3f9a983d5754

maflcko · 2023-01-27T17:13:32Z

test/functional/p2p_ibd_stalling.py

+    def all_sync_send_with_ping(self, peers):
+        for p in peers:
+            if p.is_connected:
+                p.sync_send_with_ping()


This fails?

https://cirrus-ci.com/task/4620776167440384?logs=ci#L2630

test 2023-01-27T16:28:47.291000Z TestFramework.p2p (DEBUG): Received message from 0:0: msg_pong(nonce=0000000f) node0 2023-01-27T16:28:47.340846Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 node0 2023-01-27T16:28:47.347460Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.0] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ node0 2023-01-27T16:28:47.347940Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 node0 2023-01-27T16:28:47.348962Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.1] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=setmocktime user=__cookie__ node0 2023-01-27T16:28:47.349250Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 node0 2023-01-27T16:28:47.349271Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.2] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ node0 2023-01-27T16:28:47.393691Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5738] [SendMessages] Peer=1 is stalling block download, disconnecting node0 2023-01-27T16:28:47.398607Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5744] [SendMessages] [net] Increased stalling timeout temporarily to 16 seconds node0 2023-01-27T16:28:47.406186Z (mocktime: 2023-01-27T16:29:03Z) [net] [net.cpp:573] [CloseSocketDisconnect] [net] disconnecting peer=1 test 2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Send message to 0:0: msg_ping(nonce=00000010) test 2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Closed connection to: 0:0 node0 2023-01-27T16:28:47.414007Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554 node0 2023-01-27T16:28:47.414037Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.3] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ node0 2023-01-27T16:28:47.414073Z (mocktime: 2023-01-27T16:29:03Z) [net] [net_processing.cpp:1541] [FinalizeNode] [net] Cleared nodestate for peer=1 test 2023-01-27T16:28:47.474000Z TestFramework (ERROR): Assertion failed Traceback (most recent call last): File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/test_framework.py", line 134, in main self.run_test() File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 133, in run_test self.all_sync_send_with_ping(peers) File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 154, in all_sync_send_with_ping p.sync_send_with_ping() File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 560, in sync_send_with_ping self.sync_with_ping() File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 570, in sync_with_ping self.wait_until(test_function, timeout=timeout) File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 463, in wait_until wait_until_helper(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor) File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/util.py", line 267, in wait_until_helper if predicate(): File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 460, in test_function assert self.is_connected AssertionError test 2023-01-27T16:28:47.489000Z TestFramework (DEBUG): Closing down network thread

looks like the p2p instance has been disconnected by bitcoind, but python hasn't received the callback yet, so it attempts to send a ping in between these events. I think counting the nodes with is_connected instead of using num_test_p2p_connections() will fix this. I'll open a PR.

fixed in #26982

b2a1e47 net_processing: simplify logging statement (Martin Zumsande) 6548ba6 test: fix intermittent errors in p2p_ibd_stalling.py (Martin Zumsande) Pull request description: Two small fixups to #25880: - Use `is_connected` instead of `num_test_p2p_connections` to avoid intermittent failures where the p2p MiniNode got disconnected but this info hasn't made it to python yet, so it fails a ping. (bitcoin/bitcoin#25880 (comment)) - Simplify a logging statement (suggested in bitcoin/bitcoin#25880 (comment)) ACKs for top commit: MarcoFalke: review ACK b2a1e47 🕧 Tree-SHA512: 337f0883bf1c94cc26301a80dfa649093ed1e211ddda1acad8449a2add5be44e5c12d6073c209df9c7aa1edb9da33ec1cfdcb0deafd76178ed78785843e80bc7

b2a1e47 net_processing: simplify logging statement (Martin Zumsande) 6548ba6 test: fix intermittent errors in p2p_ibd_stalling.py (Martin Zumsande) Pull request description: Two small fixups to bitcoin#25880: - Use `is_connected` instead of `num_test_p2p_connections` to avoid intermittent failures where the p2p MiniNode got disconnected but this info hasn't made it to python yet, so it fails a ping. (bitcoin#25880 (comment)) - Simplify a logging statement (suggested in bitcoin#25880 (comment)) ACKs for top commit: MarcoFalke: review ACK b2a1e47 🕧 Tree-SHA512: 337f0883bf1c94cc26301a80dfa649093ed1e211ddda1acad8449a2add5be44e5c12d6073c209df9c7aa1edb9da33ec1cfdcb0deafd76178ed78785843e80bc7

Summary: This makes the stalling detection mechanism (previously a fixed timeout of 2s) adaptive: If we disconnect a peer for stalling, double the timeout for the next peer - and let it slowly relax back to its default value each time the tip advances. (Idea by Pieter Wuille) This makes situations more unlikely in which we'd keep on disconnecting many of our peers for stalling, even though our own bandwidth is insufficient to download a block in 2 seconds. Co-authored-by: Vasil Dimov <vd@FreeBSD.org> This is a backport of [ [[bitcoin/bitcoin#25880 | core#25880]] and [[bitcoin/bitcoin#26982 | core#26982]] Note that we need to specify `version=4` when calling `create_block` in the test because we miss [[bitcoin/bitcoin#16333 | core#16333]] (which make 4 the default version for create_block) Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D15080

DrahtBot added the P2P label Aug 19, 2022

mzumsande marked this pull request as draft August 22, 2022 14:21

mzumsande force-pushed the 202208_stalling_timeout branch from ea9915b to 686936c Compare August 24, 2022 14:55

mzumsande changed the title ~~p2p: Increase BLOCK_STALLING_TIMEOUT timeout during IBD~~ p2p: Make BLOCK_STALLING_TIMEOUT timeout adaptive during IBD Aug 24, 2022

mzumsande marked this pull request as ready for review August 24, 2022 15:09

mzumsande changed the title ~~p2p: Make BLOCK_STALLING_TIMEOUT timeout adaptive during IBD~~ p2p: Make stalling timeout adaptive during IBD Aug 24, 2022

luke-jr reviewed Aug 26, 2022

View reviewed changes

mzumsande force-pushed the 202208_stalling_timeout branch from 686936c to 7c8c4e4 Compare August 30, 2022 22:31

w0xlt approved these changes Aug 31, 2022

View reviewed changes

vasild reviewed Aug 31, 2022

View reviewed changes

mzumsande force-pushed the 202208_stalling_timeout branch from 7c8c4e4 to a764c20 Compare September 14, 2022 17:37

mzumsande force-pushed the 202208_stalling_timeout branch from a764c20 to 48e5385 Compare September 14, 2022 18:55

DrahtBot mentioned this pull request Sep 15, 2022

tests: Run both descriptor and legacy tests within a single test invocation #20892

Closed

dergoegge reviewed Sep 15, 2022

View reviewed changes

sipa reviewed Sep 15, 2022

View reviewed changes

amovfx reviewed Sep 18, 2022

View reviewed changes

naumenkogs reviewed Sep 20, 2022

View reviewed changes

src/net_processing.cpp Outdated Show resolved Hide resolved

luke-jr suggested changes Sep 20, 2022

View reviewed changes

vasild reviewed Oct 20, 2022

View reviewed changes

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

test/functional/p2p_ibd_stalling.py Show resolved Hide resolved

mzumsande force-pushed the 202208_stalling_timeout branch from 4b0dbc0 to 9339230 Compare October 21, 2022 22:12

mzumsande force-pushed the 202208_stalling_timeout branch from 9339230 to cc81679 Compare October 24, 2022 20:27

mzumsande force-pushed the 202208_stalling_timeout branch from cc81679 to aceff9e Compare October 25, 2022 00:26

vasild approved these changes Oct 25, 2022

View reviewed changes

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

test: add functional test for IBD stalling logic

39b9364

mzumsande force-pushed the 202208_stalling_timeout branch from aceff9e to 39b9364 Compare October 27, 2022 19:14

vasild approved these changes Oct 28, 2022

View reviewed changes

fanquake requested review from naumenkogs and luke-jr October 28, 2022 11:24

naumenkogs approved these changes Nov 4, 2022

View reviewed changes

sipa reviewed Jan 11, 2023

View reviewed changes

achow101 merged commit 835212c into bitcoin:master Jan 27, 2023

mzumsande deleted the 202208_stalling_timeout branch January 27, 2023 15:25

maflcko reviewed Jan 27, 2023

View reviewed changes

mzumsande mentioned this pull request Jan 27, 2023

p2p: 25880 fixups (stalling timeout) #26982

Merged

glozow mentioned this pull request Feb 24, 2023

Socratic Seminar #22 Topics londonbitdevs/londonbitdevs-website#9

Closed

pinheadmz mentioned this pull request Apr 27, 2023

v 0.17.0 win 7 64 bit missing recent transactions. #14541

Closed

bitcoin locked and limited conversation to collaborators Jan 27, 2024

		m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT);
		LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());

p2p: Make stalling timeout adaptive during IBD #25880

p2p: Make stalling timeout adaptive during IBD #25880

Uh oh!

Conversation

mzumsande commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipa commented Aug 19, 2022

Uh oh!

luke-jr commented Aug 20, 2022

Uh oh!

mzumsande commented Aug 22, 2022

Uh oh!

mzumsande commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajtowns commented Aug 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande commented Aug 30, 2022

Uh oh!

w0xlt left a comment

Choose a reason for hiding this comment

Uh oh!

vasild left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasild Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

naumenkogs commented Aug 31, 2022

Uh oh!

ajtowns commented Aug 31, 2022

Uh oh!

dergoegge commented Sep 9, 2022

Uh oh!

mzumsande commented Sep 12, 2022

Uh oh!

mzumsande commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews

Conflicts

Uh oh!

dergoegge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande commented Aug 19, 2022 •

edited

Loading

mzumsande commented Aug 24, 2022 •

edited

Loading

mzumsande Aug 30, 2022 •

edited

Loading

mzumsande Aug 30, 2022 •

edited

Loading

vasild Aug 31, 2022 •

edited

Loading

mzumsande commented Sep 14, 2022 •

edited

Loading

DrahtBot commented Sep 15, 2022 •

edited

Loading

mzumsande commented Oct 21, 2022 •

edited

Loading

mzumsande commented Oct 24, 2022 •

edited

Loading