Skip to content

Conversation

sdaftuar
Copy link
Member

This fixes the download behavior to allow downloading blocks on a chain that is known only by inbound peers and not by outbound/preferred download peers, mentioned in issue #5138.

@sipa I tried implementing your suggestion (on IRC) of just guarding the getdata code block with a (preferredpeer || rand() %100 == 0). I think this works for what we were discussing (where an inbound peer announces a tip more than 1 block ahead on a chain no one else knows about), but it doesn't resolve @gavinandresen's test case in issue #5138 because we still wouldn't be sending getheaders when an inbound peer connects like we do for outbound peers. Instead, we'd have to wait for an inv that builds on the longer chain from the inbound peer before we send getheaders, and then with this code send the getdata.

One question: is it necessary to also check !pto->fOneShot() with this change? I didn't think it would be possible for us to have a one shot peer's headers and therefore no blocks we'd ever try to download, so I left it out, but perhaps it's preferable to include that check just to be safe...?

As an aside: I noticed that an additional way we could workaround the regtest issue @gavinandresen raised, besides connecting nodes to each other in both directions, would be to whitelist localhost.

I've tested this code on a modified version of the test, where after connecting a node with a longer chain and observing that the tips didn't sync, I then generated one more block on that inbound peer and observed that the tips then did sync.

Finally, I believe that if we wanted to send an initial getheaders message to all inbound peers just as we do for outbound peers so that we could immediately download a longer chain known only by an inbound peer, we could do so with a similar small change to the code that sends the initial getheaders messages (while still preferring outbound peers for downloading blocks during initial block download), but it wasn't clear to me whether such a behavior change would be desirable.

@laanwj laanwj added the P2P label Jan 15, 2015
@@ -4535,7 +4535,7 @@ bool SendMessages(CNode* pto, bool fSendTrickle)
// Message: getdata (blocks)
//
vector<CInv> vGetData;
if (!pto->fDisconnect && !pto->fClient && fFetch && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
if (!pto->fDisconnect && !pto->fClient && (fFetch || GetRandInt(100) == 0) && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use insecure_rand here. GetRandInt() returns cryptographic randomness and is thus quite expensive to call, and its unecessary here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- fixed (commit squashed and amended).

@laanwj
Copy link
Member

laanwj commented Mar 9, 2015

utACK

@gmaxwell
Copy link
Contributor

Seems like a slightly kludgy approach, but better than not, and at least its simple. Have we tested much with it set to also pull from inbound just to make sure there are no issues with that?

WRT localhost, thats probably reasonable but otoh it might hide issues. I think of this restriction on inbound as temporary until we get more controls in place to manage resource usage, after that it can go away completely I think.

@sdaftuar
Copy link
Member Author

Yeah that's pretty much how I feel too (kludgy but simple). I did some controlled regtest-testing when I wrote this, but I haven't done any live network tests yet, which I agree would be worth doing so I'll set something up and report back...

@sipa
Copy link
Member

sipa commented Mar 24, 2015

utACK. It's kludgy, but fine as an interim solution.

It should even be perfectly fine to always enable fetching from any peer, except for the implicitly accepted behavior that not accepting incoming connections will reduce bandwidth usage. When headers-first is more widely deployed, we can get rid of that policy I think, and just add an actual bandwidth throttler.

@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 1, 2015

I tried testing initial block download with this code to see what kind of load it might put on an inbound peer (in the steady state, it doesn't seem that this code triggers very often, as expected). It is apparently easier than I realized to run into a situation where a node could start up and acquire inbound peers quickly, before any outbound connections are established. In repeated testing last night, I restarted a long-running node (with -reindex and without the first blockfile, to trigger downloading the whole chain), and my node was selecting an inbound peer to be the peer it was using for initial headers sync, and then with this code that peer was exclusively serving blocks for a while.

Of course that load would have come down after the headers were almost synced and parallel download started, but even at that point you might expect that this code could cause that inbound peer to be serving as much as 1/9 of the blocks to the node (probably less than that due to the insecure_rand() guard but would depend on download speeds from the outbound peers I think).

And more generally, for a node starting up and doing a large download of the blockchain, even if it starts up and is downloading only from outbound peers for a while, if it receives an inv from an inbound peer for a new block, then it would retrieve headers from that inbound peer using the existing inv-processing logic, and this change would then cause that inbound peer to serve earlier blocks as well.

Perhaps this shouldn't be merged as is if we want to avoid using inbound peers for large-ish block download?

@sipa
Copy link
Member

sipa commented Apr 1, 2015

@sdaftuar Any suggestions for a better strategy?

…peers

SendMessages will now call getheaders on both inbound and outbound peers,
once the headers chain is close to synced.  It will also try downloading
blocks from inbound peers once we're out of initial block download (so
inbound peers will participate in parallel block fetching for the last day
or two of blocks being downloaded).
@sdaftuar sdaftuar force-pushed the block-download-fix branch from 9bfa315 to 00dcaf4 Compare April 2, 2015 17:45
@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 2, 2015

I've pushed an alternate fix here; please review this alternative solution.

What about dropping the insecure_rand in favor of a call to IsInitialBlockDownload():

if (!pto->fDisconnect && !pto->fClient && (fFetch || !IsInitialBlockDownload()) && state.nBlocksInFlight < MAX_BLOCKS_IN_TRANSIT_PER_PEER) {

This would mean that we'd be willing to let inbound peers share the download burden on up to the last ~288 blocks or so (our tip is within 24 hours of our best known header and our best known header is within 24 hours of the current time, in practice this is probably usually closer to 144 blocks). If we do this, I think we should also go ahead and send the initial getheaders messages to inbound peers as well, which I've done in this commit. This now fixes the problem gavin reported here:
#5138 (comment)

Also, I don't think the insecure_rand was giving us much; I think the idea was that in the steady state case, that would cause us to prefer downloading blocks on a reorg chain from outbound peers rather than inbound peers, but that is sufficiently rare and the load sufficiently low that I don't think optimizing that case is very important (especially because we're already willing to download the first new block announced by an inbound peer immediately, so why not the next block or two as well?).

I think this ought to work reasonably well when a far-behind node starts up, because it should have a number of peers by the time it gets close to catching up, meaning that the burden on inbound peers shouldn't be very large.

If a node were to come up an only be a day or two behind, though, then it's possible that a couple inbound peers would be hit with a large fraction of these 288 blocks, so if that is too much potential load, we could lighten it further by reducing MAX_BLOCKS_IN_TRANSIT_PER_PEER for inbound peers -- perhaps just allow 1 or 2 blocks in flight at a time from inbound peers here? I'm not sure how much optimizing is reasonable for this case; the existing logic already would put all the load on inbound peers if we don't have any outbound peers.

@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 3, 2015

One behavior worth noting is that with this change, our maximum number of blocks in flight goes up (from 128 to 16 * Number of Network Peers). It's probably not generally going to exceed 288, the most blocks that can be outstanding when inbound peers start being used with the outbound peers while fetching. But if you have a stalling peer that isn't returning a block, then with the existing block download timeout logic (which depends on the number of blocks (with validated headers) in flight at the time of a block request), that stalling peer could have about 24 hours to deliver a block before being disconnected.

Still, there are easy workarounds (eg restarting a sync with listen=0 if you hit a bad peer before finishing a big sync with the network), and this change doesn't make the problem of an unresponsive peer any worse in the steady state, so I'm not sure how much optimization is worthwhile for this case. Thoughts?

@sdaftuar
Copy link
Member Author

sdaftuar commented Apr 6, 2015

FYI I just opened #5976 to mitigate any issues with longer block download timeouts that the approach here might otherwise exacerbate.

@sipa
Copy link
Member

sipa commented Apr 7, 2015

Concept ACK on the new approach. In the steady state, this shouldn't matter much, as almost all blocks there are fetched through the direct-fetching logic, rather than the asynchronous headers-based fetching.

@gavinandresen
Copy link
Contributor

utACK.

@laanwj
Copy link
Member

laanwj commented Apr 28, 2015

utACK

@laanwj laanwj merged commit 00dcaf4 into bitcoin:master Apr 28, 2015
laanwj added a commit that referenced this pull request Apr 28, 2015
00dcaf4 Change download logic to allow calling getheaders/getdata on inbound peers (Suhas Daftuar)
@sdaftuar sdaftuar mentioned this pull request Jun 3, 2015
laanwj added a commit that referenced this pull request Sep 16, 2019
…c_invalidateblock

fae961d test: Establish only one connection between nodes in rpc_invalidateblock (MarcoFalke)

Pull request description:

  Headers and block sync should eventually converge to the same result, regardless of whether the peers treat each other as "inbound" or "outbound".

  `connect_nodes_bi` has been introduced as a (temporary?) workaround for bug #5113 and #5138, which has long been fixed in #5157 and #5662.

  Thus remove the `connect_nodes_bi` workaround from the rpc_invalidateblock test.

  Conveniently, this also closes #16453. See #16444 (comment) for rationale

ACKs for top commit:
  laanwj:
    ACK fae961d

Tree-SHA512: b3614c66a205823df73f64d19cacfbec269beb5db52ff79004d746e17d7c0dfb43ab9785fdddc97e2a76fe76286c8c605b34df3dda4a2bf5be035f01169ae89a
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Sep 16, 2019
…s in rpc_invalidateblock

fae961d test: Establish only one connection between nodes in rpc_invalidateblock (MarcoFalke)

Pull request description:

  Headers and block sync should eventually converge to the same result, regardless of whether the peers treat each other as "inbound" or "outbound".

  `connect_nodes_bi` has been introduced as a (temporary?) workaround for bug bitcoin#5113 and bitcoin#5138, which has long been fixed in bitcoin#5157 and bitcoin#5662.

  Thus remove the `connect_nodes_bi` workaround from the rpc_invalidateblock test.

  Conveniently, this also closes bitcoin#16453. See bitcoin#16444 (comment) for rationale

ACKs for top commit:
  laanwj:
    ACK fae961d

Tree-SHA512: b3614c66a205823df73f64d19cacfbec269beb5db52ff79004d746e17d7c0dfb43ab9785fdddc97e2a76fe76286c8c605b34df3dda4a2bf5be035f01169ae89a
maflcko pushed a commit that referenced this pull request Sep 18, 2019
fadfd84 test: Remove unused connect_nodes_bi (MarcoFalke)
fa3b9ee scripted-diff: test: Replace connect_nodes_bi with connect_nodes (MarcoFalke)
faaee1e test: Use connect_nodes when connecting nodes in the test_framework (MarcoFalke)
1111bb9 test: Reformat python imports to aid scripted diff (MarcoFalke)

Pull request description:

  By default all test nodes are connected in a chain. However, instead of just a single connection between each pair of nodes, we end up with up to four connections for a "middle" node (two outbound, two inbound, from each side).

  This is generally redundant (tx and block relay should succeed with just a single connection) and confusing. For example, test timeouts after a call to `sync_` may be racy and hard to reproduce. On top of that, the test `debug.log`s are hard to read because txs and block invs may be relayed on the same connection multiple times.

  Fix this by inlining `connect_nodes_bi` in the two tests that need it, and then replace it with a single `connect_nodes` in all other tests.

  Historic background:

  `connect_nodes_bi` has been introduced as a (temporary?) workaround for bug #5113 and #5138, which has long been fixed in #5157 and #5662.

ACKs for top commit:
  laanwj:
    ACK fadfd84
  jonasschnelli:
    utACK fadfd84 - more of less a cleanup PR.
  promag:
    Tested ACK fadfd84, ran extended tests.

Tree-SHA512: 2d027a8fd150749c071b64438a0a78ec922178628a7dbb89fd1212b0fa34febd451798c940101155d3617c0426c2c4865174147709894f1f1bb6cfa336aa7e24
random-zebra referenced this pull request in PIVX-Project/PIVX Mar 31, 2020
3e9081c [Tests] Fix tests duration and sorting in test_runner.py (random-zebra)
57dd5e7 [Tests][Refactor] Move connect_nodes_bi to p2p_time_offset (random-zebra)
9b36468 test: Replace connect_nodes_bi with connect_nodes (random-zebra)
c2a701e test: Use connect_nodes when connecting nodes in the test_framework (random-zebra)
5d3f2a1 test: Reformat python imports to aid scripted diff (random-zebra)

Pull request description:

  Mostly from upstream bitcoin#16898

  > By default all test nodes are connected in a chain. However, instead of just a single connection between each pair of nodes, we end up with up to four connections for a "middle" node (two outbound, two inbound, from each side).
  >
  > This is generally redundant (tx and block relay should succeed with just a single connection) and confusing. For example, test timeouts after a call to sync_ may be racy and hard to reproduce. On top of that, the test debug.logs are hard to read because txs and block invs may be relayed on the same connection multiple times.
  >
  > Fix this by inlining connect_nodes_bi in the two tests that need it, and then replace it with a single connect_nodes in all other tests.
  >
  > Historic background:
  >
  > connect_nodes_bi has been introduced as a (temporary?) workaround for bug dashpay#5113 and dashpay#5138, which has long been fixed in dashpay#5157 and dashpay#5662.

ACKs for top commit:
  furszy:
    Tested ACK 3e9081c .

Tree-SHA512: 4af3223c7e5a4b130855d9364b7ec3df5e141151fd1e108c55ef80dd523522780b6130e69cacd467906215c15da9a50496f12298246eb67876fcaa181463cae9
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants