p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling #29664

mzumsande · 2024-03-15T22:27:47Z

Problem:
For stalling at the tip, we have a parallel download mechanism for compact blocks that was added in #27626.
For stalling during IBD, we have a lookahead window of 1024 blocks, and if that is exceeded, we disconnect the stalling peer.
However, if we are close to but not at the tip (<=1024 blocks), neither of these mechanisms apply. We can't do compact blocks yet, and the stalling mechanism doesn't work because the 1024 window cannot be exceeded.

As a result, we have to resort to BLOCK_DOWNLOAD_TIMEOUT_BASE which only disconnects a peer after 10 minutes (plus 5 minutes more for each additional peers we currently have blocks in flight). This is too long in my opinion, especially since peers get assigned up to 16 blocks (MAX_BLOCKS_IN_TRANSIT_PER_PEER) and could repeat this process to stall us even longer if they send us a block after 10 minutes.
This issue was observed in #29281 and #12291 (comment) with broken peers that didn't send us blocks.

Proposed solution:
If we are 1024 or less blocks away from the tip and haven't requested or received a block from any peer for 30 seconds, add another peer to download the critical block from that would help us advance our tip. Add up to two additional peers this way.

Other thoughts

I also considered the alternative of extending the existing stalling mechanism that disconnects instead of introducing parallel downloads. This could be potentially less wasteful, but we might be over-eager to disconnect peers when really close to the tip, plus this might lead to cycling through lots of peers in extreme situations where we have a very slow internet connection.
The chosen timeout of 30 seconds could lead to inefficiencies / bandwidth waste when we have a really slow internet connection. Maybe it could make sense to track the last successful download times from existing peers and use a dynamic timeout according to that statistics, instead of setting it to a fixed value.
I will leave this PR in draft until I have tested it a bit more in the wild.

Fixes #29281

DrahtBot · 2024-03-15T22:27:50Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/29664.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	glozow, yuvicc
Stale ACK	naumenkogs

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#31859 (test: Rename send_message to send_without_ping by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

nanlour · 2024-03-17T02:04:58Z

Maybe it could make sense to track the last successful download times from existing peers and use a dynamic timeout according to that statistics, instead of setting it to a fixed value.

Is it possible for other peers to make me believe I have a slow internet connection?

mzumsande · 2024-04-29T20:06:02Z

Is it possible for other peers to make me believe I have a slow internet connection?

Sorry, I missed this question. For single peers yes, but the idea is that the data would be taken from multiple outbound peers - it is hard for an attacker to control multiple of these, see eclipse attacks.

Will soon come back to this PR and take it out of draft.

mzumsande · 2024-05-31T19:36:53Z

Cleaned up a bit, rebased and marked as ready for review!

glozow · 2024-07-04T11:08:40Z

concept ACK

naumenkogs · 2024-09-10T12:59:02Z

Concept ACK

mzumsande · 2024-09-17T22:48:09Z

824c125 to 1bcfcf2: Rebased

maflcko · 2024-09-18T14:35:51Z

The CI failed, but the logs are missing :(

https://github.com/bitcoin/bitcoin/pull/29664/checks?check_run_id=30286710796

�[0m�[0;31mp2p_ibd_stalling.py --v2transport                        | ✖ Failed  | 2465 s

I'll re-run 10 times.

mzumsande · 2024-09-18T15:12:01Z

The CI failed, but the logs are missing :( I'll re-run 10 times.

Failed log from CI: https://api.cirrus-ci.com/v1/task/5540180679983104/logs/ci.log

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

instagibbs · 2024-09-18T18:24:14Z

IIUC(? could be off-base), this could also fire at tip if there are no hb peers and the first peer requested stalls out for 30s? Once parallel compact blocks is warmed up, the slots will likely be taken by those efforts.

If so, I wonder if this could allow us to f.e., not require a parallel compact slot be taken by an outbound, if we know we will try an outbound connection 30s later to defeat stalling behavior?

Having tests for the interplay would be interesting too.

mzumsande · 2024-10-08T22:50:33Z

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

Hopefullly fixed now, I believe that there was a missing sync in the added test.

Rather than the crude 10 minute timeout - why not actually check if the block is being downloaded from the chosen peer - i.e. work out the throughput and make a decision based on this (compared to typical throughput from other peers)?

I'm not sure what exactly you mean with throughput here - do you keeping track of historical block download times, and use this as input for an algorithm for detecting stalling? If so, I'm thinking about that option (and how to exactly design an algorithm using it), and I think it could be applied to all kinds of stalling, also when deep in IBD - but it also has some downsides (more complexity, more data to keep track of, have to be careful it's not easily gameable). So I'm thinking about it but not completely sure if that is the best way to go.

In any case the current approach is more in line with the current IBD stalling detection (fixed timeouts).

naumenkogs · 2024-10-10T10:56:43Z

ACK ab45a06

instagibbs

I took a bit of time to work on test coverage, with a rough branch here: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling/

last commit needs some clean up, but demonstrates some of the interplay of parallel blocks both compact and non-compact

test/functional/p2p_ibd_stalling.py

DrahtBot · 2024-10-10T19:32:56Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/31374303124}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

mzumsande · 2024-10-10T19:36:24Z

I took a bit of time to work on test coverage, with a rough branch here: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling/

last commit needs some clean up, but demonstrates some of the interplay of parallel blocks both compact and non-compact

Thanks!
I also worked on this yesterday/today and wrote a similar test, but added that as an at_tip_stalling subtest to p2p_ibd_stalling.py - while your test covering the logic in a similar way is in p2p_compactblocks.py. In which location do you think that it fits better (or even both?)?

Your first commit looks great, will definitely take that (but haven't yet with the latest push).

One more thing: I added the condition m_last_block_requested.load() > 0s so that the near-tip stalling logic can't hit right after startup when we never requested a block from anyone.

instagibbs · 2024-10-10T20:00:03Z

while your test covering the logic in a similar way is in p2p_compactblocks.py. In which location do you think that it fits better (or even both?)?

I think your location is also fine, we're not testing that it works, just that it's being requested pretty much.

instagibbs · 2024-10-11T15:51:40Z

Just for brainstorming, also made a modified branch that reduces parallel compact block downloads to a count of 2, which would allow one or more backoff blocks at tip at a less aggressive pace: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling_2/

This is in preparation to add more subtests. Also adjust waiting condition to replace the total_bytes_recv_for_blocks magical number hack.

If we are 1024 or less blocks away from the tip and haven't requested or received a block from any peer for 30 seconds, add another peer to download the critical block from. Add up to two additional peers this way. Also adds test for the new behavior (co-authored by Greg Sanders) Co-authored-by: Greg Sanders <gsanders87@gmail.com>

When at the tip, the near-tip stalling still applies, but it will interact with compact block requests. Add tests for this.

mzumsande · 2024-10-11T18:27:20Z

973cd01 to 5e1ff82:

added more detailed coverage by @instagibbs (thanks!)
provided missing block and check that sync finishes in the end in near_tip_stalling subtest
remove unused total_bytes_recv_for_blocks in first commit

Just for brainstorming, also made a modified branch that reduces parallel compact block downloads to a count of 2, which would allow one or more backoff blocks at tip at a less aggressive pace:

That sounds reasonable. Is it a problem that on slow connections or new connections with unfilled mempools the parallel compact blocks download (which is triggered immediately instead of having a timeout) would hit more often than necessary?

instagibbs · 2024-10-11T18:46:38Z

Is it a problem that on slow connections or new connections with unfilled mempools the parallel compact blocks download (which is triggered immediately instead of having a timeout) would hit more often than necessary?

I think that's already the case, a saving grace is that since we only trigger parallel cb with high-bandwidth cb peers it won't happen until you've seen a few blocks at tip at least.

rebroad · 2024-10-13T00:45:16Z

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

Hopefullly fixed now, I believe that there was a missing sync in the added test.

Rather than the crude 10 minute timeout - why not actually check if the block is being downloaded from the chosen peer - i.e. work out the throughput and make a decision based on this (compared to typical throughput from other peers)?

I'm not sure what exactly you mean with throughput here - do you keeping track of historical block download times, and use this as input for an algorithm for detecting stalling? If so, I'm thinking about that option (and how to exactly design an algorithm using it), and I think it could be applied to all kinds of stalling, also when deep in IBD - but it also has some downsides (more complexity, more data to keep track of, have to be careful it's not easily gameable). So I'm thinking about it but not completely sure if that is the best way to go.

In any case the current approach is more in line with the current IBD stalling detection (fixed timeouts).

I mean the current download speed - averaged out over the last 30 seconds should be sufficient to make a reasonably quick decision.

DrahtBot · 2025-03-16T09:58:12Z

🐙 This pull request conflicts with the target branch and needs rebase.

yuvicc · 2025-06-10T13:47:51Z

Concept ACK

Will review this once ready!

DrahtBot added the P2P label Mar 15, 2024

mzumsande mentioned this pull request Mar 15, 2024

Post startup stalling #29281

Open

DrahtBot added the CI failed label Mar 16, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from 41d3336 to 703a6da Compare March 16, 2024 02:26

mzumsande changed the title ~~p2p: When close to the tip, download blocks in pararallel from additional peers to prevent stalling~~ p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling Mar 16, 2024

DrahtBot removed the CI failed label Mar 16, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from 703a6da to 824c125 Compare May 31, 2024 19:36

mzumsande marked this pull request as ready for review May 31, 2024 19:36

mzumsande mentioned this pull request Jun 28, 2024

Slow sync #30360

Closed

DrahtBot mentioned this pull request Aug 15, 2024

fuzz: Test headers pre-sync through p2p #30661

Merged

DrahtBot added CI failed Needs rebase and removed CI failed labels Sep 11, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from 824c125 to 1bcfcf2 Compare September 17, 2024 22:47

DrahtBot added CI failed and removed Needs rebase labels Sep 17, 2024

DrahtBot removed the CI failed label Sep 18, 2024

DrahtBot added the CI failed label Sep 18, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from 1bcfcf2 to 26e3686 Compare September 18, 2024 18:23

DrahtBot removed the CI failed label Sep 19, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from b7aefca to 10ba2bd Compare October 8, 2024 19:40

DrahtBot added the CI failed label Oct 8, 2024

mzumsande force-pushed the 202403_near_tip_stalling branch from 10ba2bd to ab45a06 Compare October 8, 2024 22:38

DrahtBot removed the CI failed label Oct 8, 2024

DrahtBot requested a review from glozow October 10, 2024 11:12

instagibbs reviewed Oct 10, 2024

View reviewed changes

test/functional/p2p_ibd_stalling.py Outdated Show resolved Hide resolved

mzumsande force-pushed the 202403_near_tip_stalling branch 2 times, most recently from 395d13d to 973cd01 Compare October 10, 2024 19:32

DrahtBot added the CI failed label Oct 10, 2024

DrahtBot removed the CI failed label Oct 11, 2024

mzumsande and others added 4 commits October 11, 2024 14:01

test: make p2p_ibd_stalling.py more modular

75668d0

This is in preparation to add more subtests. Also adjust waiting condition to replace the total_bytes_recv_for_blocks magical number hack.

test: add sub-test for near-tip stalling

c3d9881

test: Add functional test for stalling at tip

5e1ff82

When at the tip, the near-tip stalling still applies, but it will interact with compact block requests. Add tests for this.

mzumsande force-pushed the 202403_near_tip_stalling branch from 973cd01 to 5e1ff82 Compare October 11, 2024 18:20

Prabhat1308 mentioned this pull request Oct 20, 2024

Review club: Prevention of Stalling when close to the tip Bitshala/BitcoinCore-PR-Review-Club#87

Closed

mzumsande marked this pull request as draft December 6, 2024 19:17

DrahtBot mentioned this pull request Feb 13, 2025

test: Rename send_message to send_without_ping #31859

Merged

DrahtBot added the Needs rebase label Mar 16, 2025

p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling #29664

Are you sure you want to change the base?

p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling #29664

Conversation

mzumsande commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

nanlour commented Mar 17, 2024

Uh oh!

mzumsande commented Apr 29, 2024

Uh oh!

mzumsande commented May 31, 2024

Uh oh!

glozow commented Jul 4, 2024

Uh oh!

naumenkogs commented Sep 10, 2024

Uh oh!

mzumsande commented Sep 17, 2024

Uh oh!

maflcko commented Sep 18, 2024

Uh oh!

mzumsande commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

instagibbs commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzumsande commented Oct 8, 2024

Uh oh!

naumenkogs commented Oct 10, 2024

Uh oh!

instagibbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DrahtBot commented Oct 10, 2024

Uh oh!

mzumsande commented Oct 10, 2024

Uh oh!

instagibbs commented Oct 10, 2024

Uh oh!

instagibbs commented Oct 11, 2024

Uh oh!

mzumsande commented Oct 11, 2024

Uh oh!

instagibbs commented Oct 11, 2024

Uh oh!

rebroad commented Oct 13, 2024

Uh oh!

DrahtBot commented Mar 16, 2025

Uh oh!

yuvicc commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mzumsande commented Mar 15, 2024 •

edited

Loading

DrahtBot commented Mar 15, 2024 •

edited

Loading

mzumsande commented Sep 18, 2024 •

edited

Loading

instagibbs commented Sep 18, 2024 •

edited

Loading

yuvicc commented Jun 10, 2025 •

edited

Loading