Skip to content

Conversation

mzumsande
Copy link
Contributor

@mzumsande mzumsande commented Mar 15, 2024

Problem:
For stalling at the tip, we have a parallel download mechanism for compact blocks that was added in #27626.
For stalling during IBD, we have a lookahead window of 1024 blocks, and if that is exceeded, we disconnect the stalling peer.
However, if we are close to but not at the tip (<=1024 blocks), neither of these mechanisms apply. We can't do compact blocks yet, and the stalling mechanism doesn't work because the 1024 window cannot be exceeded.

As a result, we have to resort to BLOCK_DOWNLOAD_TIMEOUT_BASE which only disconnects a peer after 10 minutes (plus 5 minutes more for each additional peers we currently have blocks in flight). This is too long in my opinion, especially since peers get assigned up to 16 blocks (MAX_BLOCKS_IN_TRANSIT_PER_PEER) and could repeat this process to stall us even longer if they send us a block after 10 minutes.
This issue was observed in #29281 and #12291 (comment) with broken peers that didn't send us blocks.

Proposed solution:
If we are 1024 or less blocks away from the tip and haven't requested or received a block from any peer for 30 seconds, add another peer to download the critical block from that would help us advance our tip. Add up to two additional peers this way.

Other thoughts

  • I also considered the alternative of extending the existing stalling mechanism that disconnects instead of introducing parallel downloads. This could be potentially less wasteful, but we might be over-eager to disconnect peers when really close to the tip, plus this might lead to cycling through lots of peers in extreme situations where we have a very slow internet connection.
  • The chosen timeout of 30 seconds could lead to inefficiencies / bandwidth waste when we have a really slow internet connection. Maybe it could make sense to track the last successful download times from existing peers and use a dynamic timeout according to that statistics, instead of setting it to a fixed value.
  • I will leave this PR in draft until I have tested it a bit more in the wild.

Fixes #29281

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 15, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/29664.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK glozow, yuvicc
Stale ACK naumenkogs

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #31859 (test: Rename send_message to send_without_ping by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the P2P label Mar 15, 2024
@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from 41d3336 to 703a6da Compare March 16, 2024 02:26
@mzumsande mzumsande changed the title p2p: When close to the tip, download blocks in pararallel from additional peers to prevent stalling p2p: When close to the tip, download blocks in parallel from additional peers to prevent stalling Mar 16, 2024
@nanlour
Copy link
Contributor

nanlour commented Mar 17, 2024

Maybe it could make sense to track the last successful download times from existing peers and use a dynamic timeout according to that statistics, instead of setting it to a fixed value.

Is it possible for other peers to make me believe I have a slow internet connection?

@mzumsande
Copy link
Contributor Author

Is it possible for other peers to make me believe I have a slow internet connection?

Sorry, I missed this question. For single peers yes, but the idea is that the data would be taken from multiple outbound peers - it is hard for an attacker to control multiple of these, see eclipse attacks.

Will soon come back to this PR and take it out of draft.

@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from 703a6da to 824c125 Compare May 31, 2024 19:36
@mzumsande mzumsande marked this pull request as ready for review May 31, 2024 19:36
@mzumsande
Copy link
Contributor Author

Cleaned up a bit, rebased and marked as ready for review!

@mzumsande mzumsande mentioned this pull request Jun 28, 2024
@glozow
Copy link
Member

glozow commented Jul 4, 2024

concept ACK

@naumenkogs
Copy link
Member

Concept ACK

@mzumsande
Copy link
Contributor Author

824c125 to 1bcfcf2: Rebased

@maflcko
Copy link
Member

maflcko commented Sep 18, 2024

The CI failed, but the logs are missing :(

https://github.com/bitcoin/bitcoin/pull/29664/checks?check_run_id=30286710796

�[0m�[0;31mp2p_ibd_stalling.py --v2transport                        | ✖ Failed  | 2465 s

I'll re-run 10 times.

@mzumsande
Copy link
Contributor Author

mzumsande commented Sep 18, 2024

The CI failed, but the logs are missing :( I'll re-run 10 times.

Failed log from CI: https://api.cirrus-ci.com/v1/task/5540180679983104/logs/ci.log

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from 1bcfcf2 to 26e3686 Compare September 18, 2024 18:23
@instagibbs
Copy link
Member

instagibbs commented Sep 18, 2024

IIUC(? could be off-base), this could also fire at tip if there are no hb peers and the first peer requested stalls out for 30s? Once parallel compact blocks is warmed up, the slots will likely be taken by those efforts.

If so, I wonder if this could allow us to f.e., not require a parallel compact slot be taken by an outbound, if we know we will try an outbound connection 30s later to defeat stalling behavior?

Having tests for the interplay would be interesting too.

@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from b7aefca to 10ba2bd Compare October 8, 2024 19:40
@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from 10ba2bd to ab45a06 Compare October 8, 2024 22:38
@mzumsande
Copy link
Contributor Author

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

Hopefullly fixed now, I believe that there was a missing sync in the added test.

Rather than the crude 10 minute timeout - why not actually check if the block is being downloaded from the chosen peer - i.e. work out the throughput and make a decision based on this (compared to typical throughput from other peers)?

I'm not sure what exactly you mean with throughput here - do you keeping track of historical block download times, and use this as input for an algorithm for detecting stalling? If so, I'm thinking about that option (and how to exactly design an algorithm using it), and I think it could be applied to all kinds of stalling, also when deep in IBD - but it also has some downsides (more complexity, more data to keep track of, have to be careful it's not easily gameable). So I'm thinking about it but not completely sure if that is the best way to go.

In any case the current approach is more in line with the current IBD stalling detection (fixed timeouts).

@DrahtBot DrahtBot removed the CI failed label Oct 8, 2024
@naumenkogs
Copy link
Member

ACK ab45a06

@DrahtBot DrahtBot requested a review from glozow October 10, 2024 11:12
Copy link
Member

@instagibbs instagibbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a bit of time to work on test coverage, with a rough branch here: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling/

last commit needs some clean up, but demonstrates some of the interplay of parallel blocks both compact and non-compact

@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch 2 times, most recently from 395d13d to 973cd01 Compare October 10, 2024 19:32
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/31374303124

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@mzumsande
Copy link
Contributor Author

I took a bit of time to work on test coverage, with a rough branch here: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling/

last commit needs some clean up, but demonstrates some of the interplay of parallel blocks both compact and non-compact

Thanks!
I also worked on this yesterday/today and wrote a similar test, but added that as an at_tip_stalling subtest to p2p_ibd_stalling.py - while your test covering the logic in a similar way is in p2p_compactblocks.py. In which location do you think that it fits better (or even both?)?

Your first commit looks great, will definitely take that (but haven't yet with the latest push).

One more thing: I added the condition m_last_block_requested.load() > 0s so that the near-tip stalling logic can't hit right after startup when we never requested a block from anyone.

@instagibbs
Copy link
Member

while your test covering the logic in a similar way is in p2p_compactblocks.py. In which location do you think that it fits better (or even both?)?

I think your location is also fine, we're not testing that it works, just that it's being requested pretty much.

@instagibbs
Copy link
Member

Just for brainstorming, also made a modified branch that reduces parallel compact block downloads to a count of 2, which would allow one or more backoff blocks at tip at a less aggressive pace: https://github.com/instagibbs/bitcoin/commits/mzumsande_202403_near_tip_stalling_2/

mzumsande and others added 4 commits October 11, 2024 14:01
This is in preparation to add more subtests.
Also adjust waiting condition to replace the
total_bytes_recv_for_blocks magical number hack.
If we are 1024 or less blocks away from the tip and haven't requested or received
a block from any peer for 30 seconds, add another peer to download the critical
block from. Add up to two additional peers this way.

Also adds test for the new behavior (co-authored by Greg Sanders)

Co-authored-by: Greg Sanders <gsanders87@gmail.com>
When at the tip, the near-tip stalling still applies, but
it will interact with compact block requests. Add tests for this.
@mzumsande mzumsande force-pushed the 202403_near_tip_stalling branch from 973cd01 to 5e1ff82 Compare October 11, 2024 18:20
@mzumsande
Copy link
Contributor Author

973cd01 to 5e1ff82:

  • added more detailed coverage by @instagibbs (thanks!)
  • provided missing block and check that sync finishes in the end in near_tip_stalling subtest
  • remove unused total_bytes_recv_for_blocks in first commit

Just for brainstorming, also made a modified branch that reduces parallel compact block downloads to a count of 2, which would allow one or more backoff blocks at tip at a less aggressive pace:

That sounds reasonable. Is it a problem that on slow connections or new connections with unfilled mempools the parallel compact blocks download (which is triggered immediately instead of having a timeout) would hit more often than necessary?

@instagibbs
Copy link
Member

Is it a problem that on slow connections or new connections with unfilled mempools the parallel compact blocks download (which is triggered immediately instead of having a timeout) would hit more often than necessary?

I think that's already the case, a saving grace is that since we only trigger parallel cb with high-bandwidth cb peers it won't happen until you've seen a few blocks at tip at least.

@rebroad
Copy link
Contributor

rebroad commented Oct 13, 2024

I'm still not sure why it failed, tried to bump a mocktime with the last push, but poking in the dark since it doesn't fail locally for me...

Hopefullly fixed now, I believe that there was a missing sync in the added test.

Rather than the crude 10 minute timeout - why not actually check if the block is being downloaded from the chosen peer - i.e. work out the throughput and make a decision based on this (compared to typical throughput from other peers)?

I'm not sure what exactly you mean with throughput here - do you keeping track of historical block download times, and use this as input for an algorithm for detecting stalling? If so, I'm thinking about that option (and how to exactly design an algorithm using it), and I think it could be applied to all kinds of stalling, also when deep in IBD - but it also has some downsides (more complexity, more data to keep track of, have to be careful it's not easily gameable). So I'm thinking about it but not completely sure if that is the best way to go.

In any case the current approach is more in line with the current IBD stalling detection (fixed timeouts).

I mean the current download speed - averaged out over the last 30 seconds should be sufficient to make a reasonably quick decision.

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

@yuvicc
Copy link
Contributor

yuvicc commented Jun 10, 2025

Concept ACK

Will review this once ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Post startup stalling
9 participants