-
Notifications
You must be signed in to change notification settings - Fork 37.7k
p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling #29664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/29664. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
41d3336
to
703a6da
Compare
Is it possible for other peers to make me believe I have a slow internet connection? |
Sorry, I missed this question. For single peers yes, but the idea is that the data would be taken from multiple outbound peers - it is hard for an attacker to control multiple of these, see eclipse attacks. Will soon come back to this PR and take it out of draft. |
703a6da
to
824c125
Compare
Cleaned up a bit, rebased and marked as ready for review! |
concept ACK |
Concept ACK |
824c125
to
1bcfcf2
Compare
The CI failed, but the logs are missing :( https://github.com/bitcoin/bitcoin/pull/29664/checks?check_run_id=30286710796
I'll re-run 10 times. |
Failed log from CI: https://api.cirrus-ci.com/v1/task/5540180679983104/logs/ci.log I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me... |
1bcfcf2
to
26e3686
Compare
IIUC(? could be off-base), this could also fire at tip if there are no hb peers and the first peer requested stalls out for 30s? Once parallel compact blocks is warmed up, the slots will likely be taken by those efforts. If so, I wonder if this could allow us to f.e., not require a parallel compact slot be taken by an outbound, if we know we will try an outbound connection 30s later to defeat stalling behavior? Having tests for the interplay would be interesting too. |
b7aefca
to
10ba2bd
Compare
10ba2bd
to
ab45a06
Compare
Hopefullly fixed now, I believe that there was a missing sync in the added test.
I'm not sure what exactly you mean with throughput here - do you keeping track of historical block download times, and use this as input for an algorithm for detecting stalling? If so, I'm thinking about that option (and how to exactly design an algorithm using it), and I think it could be applied to all kinds of stalling, also when deep in IBD - but it also has some downsides (more complexity, more data to keep track of, have to be careful it's not easily gameable). So I'm thinking about it but not completely sure if that is the best way to go. In any case the current approach is more in line with the current IBD stalling detection (fixed timeouts). |
ACK ab45a06 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a bit of time to work on test coverage, with a rough branch here: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling/
last commit needs some clean up, but demonstrates some of the interplay of parallel blocks both compact and non-compact
395d13d
to
973cd01
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
Thanks! Your first commit looks great, will definitely take that (but haven't yet with the latest push). One more thing: I added the condition |
I think your location is also fine, we're not testing that it works, just that it's being requested pretty much. |
Just for brainstorming, also made a modified branch that reduces parallel compact block downloads to a count of 2, which would allow one or more backoff blocks at tip at a less aggressive pace: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling_2/ |
This is in preparation to add more subtests. Also adjust waiting condition to replace the total_bytes_recv_for_blocks magical number hack.
If we are 1024 or less blocks away from the tip and haven't requested or received a block from any peer for 30 seconds, add another peer to download the critical block from. Add up to two additional peers this way. Also adds test for the new behavior (co-authored by Greg Sanders) Co-authored-by: Greg Sanders <gsanders87@gmail.com>
When at the tip, the near-tip stalling still applies, but it will interact with compact block requests. Add tests for this.
973cd01
to
5e1ff82
Compare
That sounds reasonable. Is it a problem that on slow connections or new connections with unfilled mempools the parallel compact blocks download (which is triggered immediately instead of having a timeout) would hit more often than necessary? |
I think that's already the case, a saving grace is that since we only trigger parallel cb with high-bandwidth cb peers it won't happen until you've seen a few blocks at tip at least. |
I mean the current download speed - averaged out over the last 30 seconds should be sufficient to make a reasonably quick decision. |
🐙 This pull request conflicts with the target branch and needs rebase. |
Concept ACK Will review this once ready! |
Problem:
For stalling at the tip, we have a parallel download mechanism for compact blocks that was added in #27626.
For stalling during IBD, we have a lookahead window of 1024 blocks, and if that is exceeded, we disconnect the stalling peer.
However, if we are close to but not at the tip (<=1024 blocks), neither of these mechanisms apply. We can't do compact blocks yet, and the stalling mechanism doesn't work because the 1024 window cannot be exceeded.
As a result, we have to resort to
BLOCK_DOWNLOAD_TIMEOUT_BASE
which only disconnects a peer after 10 minutes (plus 5 minutes more for each additional peers we currently have blocks in flight). This is too long in my opinion, especially since peers get assigned up to 16 blocks (MAX_BLOCKS_IN_TRANSIT_PER_PEER
) and could repeat this process to stall us even longer if they send us a block after 10 minutes.This issue was observed in #29281 and #12291 (comment) with broken peers that didn't send us blocks.
Proposed solution:
If we are 1024 or less blocks away from the tip and haven't requested or received a block from any peer for 30 seconds, add another peer to download the critical block from that would help us advance our tip. Add up to two additional peers this way.
Other thoughts
Fixes #29281