-
Notifications
You must be signed in to change notification settings - Fork 37.7k
p2p: Make stalling timeout adaptive during IBD #25880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice observation. (Brainstorm idea) How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves? |
Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds. For an immediate fix, though, maybe just making the timeout configurable would be a good idea? Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout. |
We would also have to compare it to the speed of others and have some criterion what deviation would be enough to disconnect.
Thanks! These suggestions are similar, make a lot of sense to me, and don't look very invasive to implement, planning to change to this approach and update soon. |
ea9915b
to
686936c
Compare
I now implemented the suggestion by @sipa to double the timeout and updated the OP. I tested this manually by catching up to the best chain with an ~1 month old datadir with |
BLOCK_STALLING_TIMEOUT
timeout during IBDBLOCK_STALLING_TIMEOUT
timeout adaptive during IBD
BLOCK_STALLING_TIMEOUT
timeout adaptive during IBD
Nice! This seems a fine improvement. I think one way of looking at stalling is that it happens when one peer's bandwidth is less than 1/64th of the total bandwidth (64 = 1024/16 = window/max in transit) [0]. I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1] I think adding the blocks only nodes probably made this slightly worse, since there are now 2 extra peers, so now you need something like 25% more bandwidth in order to still have 1/64th of the total... [0] Measured in blocks of course, so even if your bandwidth in bytes is fine, you might be unlucky to be asked for 16 blocks that are 2MB each, while everyone else is just being asked for 1008 50kB blocks at you (32MB total vs 7.2MB per peer). [1] Perhaps you could implement this by keeping a global and per-peer exponential rolling average of how many blocks you download per second; then you could set the peer's in-transit limit to |
src/net_processing.cpp
Outdated
@@ -1723,6 +1729,9 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock | |||
m_txrequest.ForgetTxHash(ptx->GetWitnessHash()); | |||
} | |||
} | |||
if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think only reduce the timeout if the block was in fact downloaded in less time (than the would-be new timeout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like about the current solution that it is very simple, prevents the node from getting stuck, and doesn't require additional bookkeeping of historical block download times.
I think your suggestion wouldn't be completely straightforward to implement: The block being connected here might have been downloaded some time in the past, saved to disk, but only connected now (as a result of its predecessor being connected). So we'd need some data structure to keep track of download times for not-yet-connected blocks and add/remove entries from it during IBD.
If we did this, it would help us cycle through less peers in situations where we assign multiple blocks to a peer and halving the timeout after successfully downloading a block would lead to a stalling situation again - but note that there are also other sources of disconnections that could be improved if we kept track of this kind data during IBD: E.g. if doubling the timeout is not sufficient and we'd need to 4x or 8x it.
So if we want something better but more complicated (with bookkeeping), my feeling is that we should go for another approach altogether, like basing the stalling timeout on a running average over the last received block times from multiple peers instead of a doubling/halving approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this current code, it will stall, get a block, stall, get a block, stall, etc repeatedly...
Yes, but only a few times until the blocks preventing the tip from moving are downloaded, then the tip advances by connecting the large number of stashed blocks from the 1024 window, ending the stalling situation. If every peer is equally slow, it doesn't matter if you download a block in 2s or 1 minute from the viewpoint of the stalling logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of halving the timeout, we should, on block connection, multiply it with a factor 0.5 < f < 1
to let it go back slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it seems better to decrease it slowly. If it was a single slow peer, then there would be many blocks coming on time afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest push, I decrease it by a factor 0.85 with each connected block.
That sounds like a very interesting alternative approach. I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place? And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones? |
686936c
to
7c8c4e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 7c8c4e4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
Some adaptivity seems to be warranted, because network throughput is rarely a constant.
src/net_processing.cpp
Outdated
m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT); | ||
LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence is not atomic. If two threads execute concurrently the increase/decrease code it would lead to unexpected results. Consider this, it only does the inc/dec if no other thread changed the value in the meantime, otherwise leaves it untouched:
atomic
diff --git i/src/net_processing.cpp w/src/net_processing.cpp
index 42686f0db0..fd2f22cdcd 100644
--- i/src/net_processing.cpp
+++ w/src/net_processing.cpp
@@ -1726,14 +1726,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock
LOCK(cs_main);
for (const auto& ptx : pblock->vtx) {
m_txrequest.ForgetTxHash(ptx->GetHash());
m_txrequest.ForgetTxHash(ptx->GetWitnessHash());
}
}
- if (m_chainman.ActiveChainstate().IsInitialBlockDownload()) {
- m_block_stalling_timeout = std::max(m_block_stalling_timeout.load() / 2, DEFAULT_BLOCK_STALLING_TIMEOUT);
+ auto stalling_timeout = m_block_stalling_timeout.load();
+ const auto new_timeout = std::max(stalling_timeout / 2, DEFAULT_BLOCK_STALLING_TIMEOUT);
+ if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) {
+ LogPrint(BCLog::NET, "Decreased stalling timeout to %d seconds\n", new_timeout.count());
}
}
void PeerManagerImpl::BlockDisconnected(const std::shared_ptr<const CBlock> &block, const CBlockIndex* pindex)
{
// To avoid relay problems with transactions that were previously
@@ -5231,22 +5233,25 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
}
}
if (!vInv.empty())
m_connman.PushMessage(pto, msgMaker.Make(NetMsgType::INV, vInv));
// Detect whether we're stalling
- if (state.m_stalling_since.count() && state.m_stalling_since < current_time - m_block_stalling_timeout.load()) {
+ auto stalling_timeout = m_block_stalling_timeout.load();
+ if (state.m_stalling_since.count() && state.m_stalling_since < current_time - stalling_timeout) {
// Stalling only triggers when the block download window cannot move. During normal steady state,
// the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
// should only happen during initial block download.
LogPrintf("Peer=%d is stalling block download, disconnecting\n", pto->GetId());
pto->fDisconnect = true;
// Increase timeout for the next peer so that we don't disconnect multiple peers if our own
// bandwidth is insufficient.
- m_block_stalling_timeout = std::min(2 * m_block_stalling_timeout.load(), MAX_BLOCK_STALLING_TIMEOUT);
- LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count());
+ const auto new_timeout = std::min(2 * stalling_timeout, MAX_BLOCK_STALLING_TIMEOUT);
+ if (m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout) && stalling_timeout != new_timeout) {
+ LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", new_timeout.count());
+ }
return true;
}
// In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
// (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
// We compensate for other peers to prevent killing off peers due to our own downstream link
// being saturated. We only count validated in-flight blocks so peers can't advertise non-existing block hashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I took your suggestion and added you as coauthor - didn't know about compare_exchange_strong
before.
Concept ACK. I will give a code review ACK once you resolve vasil's comments :) |
Yeah, it shouldn't hold up this fix though, I think.
Yes.
Not really. I think you need to keep the 1024 block window (since increasing that hurts pruning), and I think that if the window gets full you should still call that "stalling". But I think if you change the Maybe a more specific example would be worthwhile. As it stands, if your first peer will give you one block every 5 seconds, and your other 9 peers will collectively give you 14 blocks every second (on average, 7.8 times faster than the first peer, in total 70 times faster), then by the time that first peer has downloaded blocks 1..15 (which takes 75 seconds), the other peers will have given you blocks 17..1039 after 73.1 seconds, and stalling gets triggered. But if the slow peer had only queued up 8 blocks, then it would have supplied them in 40 seconds, which only gives the other peers enough time to supply 560 blocks, so they won't fill up the window. Hmm, I guess it ought to be possible to simulate that scenario via the functional test's |
Concept ACK I think it would be nice if this PR also added some tests, because it looks like we didn't have any tests for the stalling mechanism in the first place. |
Will address feedback soon (and work on adding a test for the stalling logic). |
7c8c4e4
to
a764c20
Compare
a764c20
to
48e5385
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The net processing changes look good to me, left some comments on the test.
test/functional/p2p_ibd_stalling.py
Outdated
peers[-1].block_store = block_dict | ||
peers[-1].send_message(headers_message) | ||
self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2) | ||
time.sleep(0.5) # Wait until all blocks have arrived at the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of timeouts like this but in case i also don't see how to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was looking for an RPC we could wait_until
for to avoid this - but I didn't find a way of querying the number of blocks a node has downloaded (including those not connected to the chain yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getpeerinfo()[0]["bytesrecv_per_msg"]["block"] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow @glozow's idea, this should work:
diff --git i/test/functional/p2p_ibd_stalling.py w/test/functional/p2p_ibd_stalling.py
index 78626c003b..d593187d74 100755
--- i/test/functional/p2p_ibd_stalling.py
+++ w/test/functional/p2p_ibd_stalling.py
@@ -79,13 +79,14 @@ class P2PIBDStallingTest(BitcoinTestFramework):
with self.nodes[0].assert_debug_log([], unexpected_msgs=['Stall started']):
for id in range(8):
peers.append(node.add_outbound_p2p_connection(P2PStaller(stall_block), p2p_idx=id, connection_type="outbound-full-relay"))
peers[-1].block_store = block_dict
peers[-1].send_message(headers_message)
self.wait_until(lambda: self.total_blocks_sent(peers) == NUM_BLOCKS - 2)
- time.sleep(0.5) # Wait until all blocks have arrived at the node
+ self.wait_until(lambda: self.total_bytes_recv_for_blocks() == 172761)
+
self.log.info("Check that increasing the window beyond 1024 blocks triggers stalling logic")
headers_message.headers = [CBlockHeader(b) for b in blocks]
with self.nodes[0].assert_debug_log(expected_msgs=['Stall started peer=0']):
for p in peers:
p.send_message(headers_message)
@@ -149,9 +150,15 @@ class P2PIBDStallingTest(BitcoinTestFramework):
def total_blocks_sent(self, peers):
num_blocks = 0
for p in peers:
num_blocks += p.blocks_sent
return num_blocks
+ def total_bytes_recv_for_blocks(self):
+ total = 0
+ for info in self.nodes[0].getpeerinfo():
+ total += info["bytesrecv_per_msg"]["block"]
+ return total
+
if __name__ == '__main__':
P2PIBDStallingTest().main()
and even better if we can put some formula behind the magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, more pythonish:
self.wait_until(lambda: sum(e["bytesrecv_per_msg"]["block"] for e in self.nodes[0].getpeerinfo()) == 172761)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that works!
We send 1023 blocks, the formula would be 126 * 168 + 897 * 169 = 172761
(at some point the blocks get larger by 1 byte).
Not super stable because bytesrecv_per_msg
includes the extra 24 bytes magic etc. (not just the payload) - so I think the magic number would e.g. be different for BIP324 - but it's definitely better than a fixed timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using the less pythonish version because we also need a check that the "block" field exists in "bytesrecv_per_msg", and I find that easier to read. Added a comment for the magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
src/net_processing.cpp
Outdated
@@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock | |||
m_txrequest.ForgetTxHash(ptx->GetWitnessHash()); | |||
} | |||
} | |||
auto stalling_timeout = m_block_stalling_timeout.load(); | |||
// In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value | |||
const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this happens with an accuracy of 1 second?
So for example you could have the sequence 64, 54, 45, 38, 32, 27, 22, 18, 15, 12, 10, 8, 6, 5, 4, 3, 2... seconds.
The stalling_timeout != new_timeout
condition can also be placed before the exchange, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be IMO. Not sure how efficient std::atomic is with std::chrono units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this happens with an accuracy of 1 second?
It was kind of intentional, I first thought of changing m_block_stalling_timeout
to microseconds, requiring conversions for the logging etc. but then I wondered whether a higher accuracy adds anything - I also thought of simply decreasing it constantly by one second per block received (instead of using a factor).
Do you or others have a preference here?
The stalling_timeout != new_timeout condition can also be placed before the exchange, I think?
Done - I changed the order within the if
statements (in two places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed for the pr club.
src/net_processing.cpp
Outdated
@@ -1723,6 +1730,12 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock | |||
m_txrequest.ForgetTxHash(ptx->GetWitnessHash()); | |||
} | |||
} | |||
auto stalling_timeout = m_block_stalling_timeout.load(); | |||
// In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value | |||
const auto new_timeout = std::max(std::chrono::duration_cast<std::chrono::seconds> (stalling_timeout * 0.85), DEFAULT_BLOCK_STALLING_TIMEOUT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be IMO. Not sure how efficient std::atomic is with std::chrono units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test takes 1m45s on my laptop
That was when compiled with TSAN. Normal debug build takes only about 30 seconds. It still fails. There are some problems with the test:
-
First test (should not stall):
thosereceived: block...
messages in the log above (from the failure of the second test) are produced from the first test. Thesleep(0.5)
was apparently not enough, so I fixed it tosleep(5)
(just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not seeStall started
in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023received: block
messages in the log and that afterwards the stalling logic fromSendMessages()
is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early. -
Second test (should stall):
it fails because there is no "Stall started peer=0" message. I addedsleep(10)
at the end of thewith...
block to wait even more for the stall. Then it fails with this error:
AssertionError: [node 0] Expected messages "['Stall started peer=0']" does not partially match log:
- 2022-10-20T12:16:52.334202Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=1
- 2022-10-20T12:16:55.712397Z [validation.cpp:3686] [ProcessNewBlockHeaders] Synchronizing blockheaders, height: 1025 (~0.17%)
- 2022-10-20T12:16:55.712923Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=8
- 2022-10-20T12:16:59.078114Z [net_processing.cpp:5328] [SendMessages] [net] Stall started peer=1
- 2022-10-20T12:16:59.078214Z [net_processing.cpp:2806] [ProcessMessage] [net] received: headers (83028 bytes) peer=4
There is a Stall started peer=1
message but it comes 7 seconds after received: headers
and is for a different peer. Maybe we should instead wait for the message to appear with wait_for_debug_log()
and omit the peer=0
part.
I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that. |
4b0dbc0
to
9339230
Compare
Addressed the test feedback (will get to the outstanding comment for the main commit a bit later).
I rewrote the test such that it doesn't use the log anymore, but waits until all blocks are received, syncs (so that a peer could get mark as a staller), waits for 3s, syncs again (so that a peer could get disconnected), and then checks that no peer gets disconnected.
I removed the
If everyone agrees that would be ok with me. However, the stalling logic was completely untested before, which is not ideal, so the test doesn't just cover the changes from this PR. |
This makes the stalling detection mechanism (previously a fixed timeout of 2s) adaptive: If we disconnect a peer for stalling, double the timeout for the next peer - and let it slowly relax back to its default value each time the tip advances. (Idea by Pieter Wuille) This makes situations more unlikely in which we'd keep on disconnecting many of our peers for stalling, even though our own bandwidth is insufficient to download a block in 2 seconds. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
9339230
to
cc81679
Compare
cc81679
to
aceff9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK aceff9e
aceff9e
to
39b9364
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 39b9364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 39b9364
// bandwidth is insufficient. | ||
const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX); | ||
if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) { | ||
LogPrint(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", m_block_stalling_timeout.load().count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not use new_timeout
here in logs?
- It has the same meaning, unless there was some crazy concurrency, in which case the log might not make sense anyway....
- It is shorted :)
- It is probably more efficient
- Reading the code is easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, you do what i suggest in the decreasing code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somehow missed this, sorry, but added it to #26982.
ACK 39b9364 |
Strong Concept ACK 39b9364 Will do some tests ASAP. |
@@ -1723,6 +1730,16 @@ void PeerManagerImpl::BlockConnected(const std::shared_ptr<const CBlock>& pblock | |||
m_txrequest.ForgetTxHash(ptx->GetWitnessHash()); | |||
} | |||
} | |||
|
|||
// In case the dynamic timeout was doubled once or more, reduce it slowly back to its default value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should do this every time a block is connected. Whenever a staller got disconnected, and the missing block arrived from another peer, we may suddenly be able to connect dozens of blocks at once. Performing the timeout-reduction step here 16 times suffices to get it back from the maximum 64 to the minimum 2.
I think it'd be better to drop it just once every time the download window moves, regardless of how much it moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, thinking more about this, I don't think that's ideal either. The window will likely move many times between would-be stalls, even when the stalling timeout has adapted to be close to the "correct" value.
We should aim to be in a position where the stalling timeout is sort of in an equilibrium between triggering occasionally but not all the time. I think the best way to achieve that is to:
- Increase it when it triggers due to being too low (reducing the probability of triggering in the future) [implemented]
- Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sipa Do you think that your proposed change should be implemented in this PR or can it be done in a followup? From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to look into this feedback closely yet - but I am planning to do that next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be a good follow-up, but anyway, here are my thoughts.
Decrease it when it didn't trigger due to being high enough.
I think this is a very good abstract policy.
when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached.
This sounds more efficient at doing what it's supposed to do than what's implemented in this PR currently.
; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.
Sounds like a good concrete policy, but not going lower than 2 seconds probably. One could do some math modeling, but I don't think it's that helpful:
- with random data, there is no ground truth — one would have to rely on the human sanity check of the inputs, which we already do verbally here;
- could be tested against a couple laggy nodes too, comparing between different policies, but eh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, this PR seems to be strictly an improvement even if the stalling timeout backs off too aggressively.
I'd say that with its current approach, the PR improves behavior in isolated stalling situations where there are currently repeated disconnections of many/all of our peers without making any progress in getting blocks - but a previous stalling situation will not affect the behavior in future stalling situations because all memory of the previous stalling incident will be lost after a few blocks:
@sipa's suggestion would introduce a long-lasting memory of previous stalling situations
- I think that one downside of this approach is that the moving window algorithm should only lead to a stalling situation if one peer is significantly slower than the rest of the peers. We'd want to replace this first peer usually - giving it more time based on previous stalling situations is probably not beneficial, because if it was comparably fast to other peers, this would not have led to this peer being marked as a staller in the first place.
- The upside is in situations where the actual time to download a block for us is significantly larger than 2 seconds - we'd churn through multiple peers / timeout doublings again in every stalling situation right now until we reach the "correct" timeout, but wouldn't anymore with a longer memory.
- Decrease it when it didn't trigger due to being high enough. And I think we have a way of measuring that: when the stalling detection triggers, and the stalling timer starts, but then the timeout is not reached. And by seeing how long it actually took for before the stalling state is resolved we can even do better than just applying a % drop; e.g. we could set the new timeout to (old_timeout + actual_time_used) / 2.
I think this would mean moving to the decrease of the stalling timeout to ProcessMessages
(NetMsgType::BLOCK
) where m_stalling_since
is currently reset back to 0. At this point we haven't validated the block yet or connected it to the chain, wo we'd likely would need to at least make sure that we only decrease it after receiving the actual block that allows us to extend our chain (the peer might also have been sending us another block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzumsande I see. So the point isn't so much that you're trying to build something that tries to measure and converge towards an optimal long-term stalling timeout for your network conditions, but rather want something that deliberately gives a temporary "cool down" period after a stalling kick so it doesn't result in a flurry of disconnects.
So I think something like my suggestion still makes sense, but it's perhaps an orthogonal thing, and not for this PR.
39b9364 test: add functional test for IBD stalling logic (Martin Zumsande) 0565951 p2p: Make block stalling timeout adaptive (Martin Zumsande) Pull request description: During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (`BLOCK_STALLING_TIMEOUT`) and if it doesn't, disconnect it and assign the critical block we need to another peer. Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in bitcoin#9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones. The `2s` stalling timeout was introduced in bitcoin#4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then. But it would be good to revisit this timeout now. This PR makes the timout adaptive (idea by sipa): If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore. Fixes bitcoin#9213 ACKs for top commit: achow101: ACK 39b9364 RandyMcMillan: Strong Concept ACK 39b9364 vasild: ACK 39b9364 naumenkogs: ACK 39b9364 Tree-SHA512: 85bc57093b2fb1d28d7409ed8df5a91543909405907bc129de7c6285d0810dd79bc05219e4d5aefcb55c85512b0ad5bed43a4114a17e46c35b9a3f9a983d5754
def all_sync_send_with_ping(self, peers): | ||
for p in peers: | ||
if p.is_connected: | ||
p.sync_send_with_ping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails?
https://cirrus-ci.com/task/4620776167440384?logs=ci#L2630
test 2023-01-27T16:28:47.291000Z TestFramework.p2p (DEBUG): Received message from 0:0: msg_pong(nonce=0000000f)
node0 2023-01-27T16:28:47.340846Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554
node0 2023-01-27T16:28:47.347460Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.0] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
node0 2023-01-27T16:28:47.347940Z (mocktime: 2023-01-27T16:29:01Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554
node0 2023-01-27T16:28:47.348962Z (mocktime: 2023-01-27T16:29:01Z) [httpworker.1] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=setmocktime user=__cookie__
node0 2023-01-27T16:28:47.349250Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554
node0 2023-01-27T16:28:47.349271Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.2] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
node0 2023-01-27T16:28:47.393691Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5738] [SendMessages] Peer=1 is stalling block download, disconnecting
node0 2023-01-27T16:28:47.398607Z (mocktime: 2023-01-27T16:29:03Z) [msghand] [net_processing.cpp:5744] [SendMessages] [net] Increased stalling timeout temporarily to 16 seconds
node0 2023-01-27T16:28:47.406186Z (mocktime: 2023-01-27T16:29:03Z) [net] [net.cpp:573] [CloseSocketDisconnect] [net] disconnecting peer=1
test 2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Send message to 0:0: msg_ping(nonce=00000010)
test 2023-01-27T16:28:47.414000Z TestFramework.p2p (DEBUG): Closed connection to: 0:0
node0 2023-01-27T16:28:47.414007Z (mocktime: 2023-01-27T16:29:03Z) [http] [httpserver.cpp:239] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:51554
node0 2023-01-27T16:28:47.414037Z (mocktime: 2023-01-27T16:29:03Z) [httpworker.3] [rpc/request.cpp:179] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
node0 2023-01-27T16:28:47.414073Z (mocktime: 2023-01-27T16:29:03Z) [net] [net_processing.cpp:1541] [FinalizeNode] [net] Cleared nodestate for peer=1
test 2023-01-27T16:28:47.474000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/test_framework.py", line 134, in main
self.run_test()
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 133, in run_test
self.all_sync_send_with_ping(peers)
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/p2p_ibd_stalling.py", line 154, in all_sync_send_with_ping
p.sync_send_with_ping()
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 560, in sync_send_with_ping
self.sync_with_ping()
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 570, in sync_with_ping
self.wait_until(test_function, timeout=timeout)
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 463, in wait_until
wait_until_helper(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor)
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/util.py", line 267, in wait_until_helper
if predicate():
File "/private/var/folders/v7/fs2b0v3s0lz1n57gj9y4xb5m0000gn/T/cirrus-ci-build/ci/scratch/build/bitcoin-arm64-apple-darwin/test/functional/test_framework/p2p.py", line 460, in test_function
assert self.is_connected
AssertionError
test 2023-01-27T16:28:47.489000Z TestFramework (DEBUG): Closing down network thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the p2p instance has been disconnected by bitcoind, but python hasn't received the callback yet, so it attempts to send a ping in between these events. I think counting the nodes with is_connected
instead of using num_test_p2p_connections()
will fix this. I'll open a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #26982
b2a1e47 net_processing: simplify logging statement (Martin Zumsande) 6548ba6 test: fix intermittent errors in p2p_ibd_stalling.py (Martin Zumsande) Pull request description: Two small fixups to #25880: - Use `is_connected` instead of `num_test_p2p_connections` to avoid intermittent failures where the p2p MiniNode got disconnected but this info hasn't made it to python yet, so it fails a ping. (bitcoin/bitcoin#25880 (comment)) - Simplify a logging statement (suggested in bitcoin/bitcoin#25880 (comment)) ACKs for top commit: MarcoFalke: review ACK b2a1e47 🕧 Tree-SHA512: 337f0883bf1c94cc26301a80dfa649093ed1e211ddda1acad8449a2add5be44e5c12d6073c209df9c7aa1edb9da33ec1cfdcb0deafd76178ed78785843e80bc7
b2a1e47 net_processing: simplify logging statement (Martin Zumsande) 6548ba6 test: fix intermittent errors in p2p_ibd_stalling.py (Martin Zumsande) Pull request description: Two small fixups to bitcoin#25880: - Use `is_connected` instead of `num_test_p2p_connections` to avoid intermittent failures where the p2p MiniNode got disconnected but this info hasn't made it to python yet, so it fails a ping. (bitcoin#25880 (comment)) - Simplify a logging statement (suggested in bitcoin#25880 (comment)) ACKs for top commit: MarcoFalke: review ACK b2a1e47 🕧 Tree-SHA512: 337f0883bf1c94cc26301a80dfa649093ed1e211ddda1acad8449a2add5be44e5c12d6073c209df9c7aa1edb9da33ec1cfdcb0deafd76178ed78785843e80bc7
Summary: This makes the stalling detection mechanism (previously a fixed timeout of 2s) adaptive: If we disconnect a peer for stalling, double the timeout for the next peer - and let it slowly relax back to its default value each time the tip advances. (Idea by Pieter Wuille) This makes situations more unlikely in which we'd keep on disconnecting many of our peers for stalling, even though our own bandwidth is insufficient to download a block in 2 seconds. Co-authored-by: Vasil Dimov <vd@FreeBSD.org> This is a backport of [ [[bitcoin/bitcoin#25880 | core#25880]] and [[bitcoin/bitcoin#26982 | core#26982]] Note that we need to specify `version=4` when calling `create_block` in the test because we miss [[bitcoin/bitcoin#16333 | core#16333]] (which make 4 the default version for create_block) Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D15080
During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (
BLOCK_STALLING_TIMEOUT
) and if it doesn't, disconnect it and assign the critical block we need to another peer.Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in #9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones.
The
2s
stalling timeout was introduced in #4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then.But it would be good to revisit this timeout now.
This PR makes the timout adaptive (idea by sipa):
If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore.
Fixes #9213