Skip to content

Conversation

sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jul 27, 2022

On startup, if our headers chain is more than a day behind current time, we'll pick one peer to sync headers with until our best headers chain is caught up (at that point, we'll try to sync headers with all peers).

However, if an INV for a block is received before our headers chain is caught up, we'll then start to sync headers from each peer announcing the block. This can result in doing a big headers sync with many (if not all) of our peers simultaneously, which wastes bandwidth.

This PR would reduce that overhead by picking (at most) one new peer to try syncing headers with whenever a new block is announced, prior to our headers chain being caught up.

Post-merge edit: See #25720 (comment) for more context about this change.

@fanquake fanquake added the P2P label Jul 27, 2022
@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from 0ae123f to 134d8d1 Compare July 27, 2022 15:54
@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from 134d8d1 to 4a33763 Compare July 27, 2022 21:23
Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few questions

// as new blocks are found, we are willing to add one new peer per
// block to sync with as well, to sync quicker in the case where
// our initial peer is unresponsive (but less bandwidth than we'd
// use if we turned on sync with all peers).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the initial headers sync logic is robust, why add the additional complexity here? Wouldn't it be better to just skip the getheaders message? Or if the logic isn't assumed to be robust, lower the timeout or add a new peer on a random timeout?

Copy link
Member Author

@sdaftuar sdaftuar Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I completely understand what you're asking, but the topic is more complicated than just an assumption around whether the initial headers sync logic is robust:

  • Ideally, we would only download the full headers chain from a single peer when we are starting up, because it saves bandwidth to do so.
  • It's possible that the peer we pick for initial headers sync could be (a) slow, (b) on a chain that is not the main chain, (c) adversarial, or some other terrible combination of those factors. So we cannot just have our logic rely on the initial peer to serve us the honest chain in a reasonable amount of time.
  • We currently have two behaviors that help protect us from choosing a bad initial peer. The main protection we have is that when a block INV is received, we send a getheaders to all peers that announce the block, resulting in us getting the main chain with high probability. However, this is bandwidth wasting if we have many peers that serve us an INV at the same time, which is probably the common case when we're in a scenario that our initial peer is slow.
  • The second protection we have is that after about 20 minutes, we'll evict our initial headers-sync peer if our tip's timestamp isn't within a day of the current time. This could kick in if we have a bad initial peer and no blocks are found for a while.

I think we could do a variety of things to improve the current situation on master; I think that adding (say) one additional headers sync peer on some kind of timer (maybe every 5 or 10 minutes) could make sense. I think that choosing a random peer among the set of peers announcing a block is probably better peer selection than choosing a random peer (or random outbound peer) on a timer, just because if a peer sends an INV there's more reason to believe that they are responsive and going to be helpful in getting us the chain, but probably some combination of both would be even better.

However, the complexity I ran into when thinking about other strategies for initial sync has to do with the eviction logic. Right now, I think it's mostly good that we evict our (single) initial headers-sync peer if we can't get a chain tip that is recent within 20 minutes. However, triggering that logic on all our peers at the same time seems over the top to me, because there are edge-case scenarios (such as: no blocks have been found on the network for a day, or the honest chain is some kind of billion-block timewarp chain that takes more than 20 minutes to download) where I think such logic could be badly behaved for the network, because we could end up with no peers or we could fall out of consensus.

I think what I'm proposing in this patch is a narrow change that exactly addresses the bandwidth problem, and maximizes the chance we find a good peer quickly, without making our behavior in those edge-case scenarios any worse. Nevertheless, a bigger overhaul of this logic that carefully considers these things could certainly be an improvement and make this whole thing easier to think about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdaftuar can you move this comment into the top of the PR; I think this context helps quite a bit (and was going to ask this a comment on the PR directly).

@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from 4a33763 to 17f2822 Compare August 3, 2022 13:48
Copy link
Member

@dergoegge dergoegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK


def run_test(self):
self.log.info("Adding a peer to node0")
peer1 = self.nodes[0].add_p2p_connection(P2PInterface())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to happen in this PR but in the future this test could also cover our preference for outbound peers for the initial headers sync.

@LarryRuane
Copy link
Contributor

Related: #8306

Copy link
Member

@pablomartin4btc pablomartin4btc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK.

Tested ACK (unit & functional tests).

Copy link
Contributor

@hernanmarino hernanmarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested ACK

// our initial peer is unresponsive (but less bandwidth than we'd
// use if we turned on sync with all peers).
CNodeState& state{*Assert(State(pfrom.GetId()))};
if (state.fSyncStarted || (!peer->m_inv_triggered_getheaders_before_sync && *best_block != m_last_block_inv_triggering_headers_sync)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are two blocks found at the same time/height, this could still trigger a getheaders to most peers, no? (Someone sends block A, then someone else sends block B, then someone else sends A, etc) Might it be better to set m_last_extra_headers_sync = SteadyClock::now(); and check that it's been at least a minute before adding an extra one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but I think your suggestion is less robust overall than what I have proposed here: sending an INV to a node is free (there is no proof-of-work attached to a block hash), so an adversary could take advantage of your proposed strategy by continually connecting, sending an INV, and disconnecting, to prevent a node from trying to sync with any of its honest peers that are announcing main-chain blocks.

On the other hand, I just checked one of my long-running, well-connected nodes, and it's seen about 10 stale blocks in the last 50000. So that seems like pretty good odds that a node starting up is unlikely to run into this?

@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from 17f2822 to 61931d7 Compare August 12, 2022 14:46
Copy link
Contributor

@LarryRuane LarryRuane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this patch pretty carefully, it's working as intended. It's not a huge improvement (I know it's not expected to be), I started IBD on mainnet and there definitely are fewer redundant getheaders. Here's a summary of what I observed by enabling the "p2p" debug category:

The node begins by getting headers from a single peer (that's the same as without this PR). When the first block announcement arrives (via inv), which usually comes (redundantly) from all of our peers, the node begins requesting headers from only one of those peers (instead of all peers that sent the same inv), which is the intended effect of this PR. A moment later, the headers reply from this peer triggers another getheaders request. This is also occurring with the original peer, so now there are two redundant "threads" (not OS threads, but threads in the sense that each headers reply launches another getheaders request to that peer).

This means that each header is received twice from here onward. If and when a second block announcement appears, a third getheaders "thread" begins, so now we're getting each header 3 times. But, again, this is much better than getting each header N times (if N is the number of our peers), which is what happens without this PR.

As others have said, this initial headers sync process can be further improved, but it's a tricky area so caution is advised! This is a step in the right direction and it seems safe to me.

Note that this isn't a huge problem in practice because, depending on network speed, the node can often download headers all the way out to the current tip before even one new block arrives, since the full header sync typically takes only a few minutes. And if a new block does arrive during this time, it's not a big problem to be receiving these redundant headers; each is only just over 80 bytes in serialized form.

I'm going to hold off on ack for now because there are a couple of changes I'd like to see in the test.

self.log.info("Check that exactly 1 of {peer2, peer3} received a getheaders in response")
count = 0
peer_receiving_headers = None
for p in [peer2, peer3]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for p in [peer2, peer3]:
for p in [peer2, peer3]:
p.sync_with_ping()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm missing something; I believe this sync_with_ping() is unnecessary because we use send_and_ping() in announce_random_block(), does that sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, ignore my suggestion.

if peer2 == peer_receiving_headers:
expected_peer = peer3

expected_peer.wait_for_getheaders()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected_peer.wait_for_getheaders()
expected_peer.wait_for_getheaders()
peer_receiving_headers.sync_with_ping()
with p2p_lock:
assert "getheaders" not in peer_receiving_headers.last_message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I intentionally did not include this test (that a peer that we've started sync with doesn't receive a subsequent getheaders in response to an INV) because I felt like that was overspecifying what I think of as reasonable behavior. I think the most important property is that we add a new peer with each new block, but whether we continue trying to sync with existing peers is (in my view) up for debate.

For instance, if the test were rewritten a bit so that the peer receiving the initial getheaders in response to an INV had a big long headers chain to serve the node, then you'd expect there to be a getheaders in flight as headers sync proceeds. Granted that is not the situation in this test right now, but I feel like it's helpful for future test maintenance to not overspecify behavior if it doesn't really matter for what we're trying to achieve.

If our headers chain is behind on startup, then if a block is found we'll try
to catch up from all peers announcing the block, in addition to our initial
headers-sync peer. This commit changes behavior so that in this situation,
we'll choose at most one peer announcing a block to additionally sync headers
from.
@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from 61931d7 to e1a14c0 Compare August 12, 2022 21:05
@sdaftuar sdaftuar force-pushed the 2022-07-reduce-headers-sync-bandwidth branch from e1a14c0 to f6a9166 Compare August 12, 2022 21:13
@LarryRuane
Copy link
Contributor

ACK f6a9166

1 similar comment
@ajtowns
Copy link
Contributor

ajtowns commented Aug 15, 2022

ACK f6a9166

@mzumsande
Copy link
Contributor

mzumsande commented Aug 15, 2022

ACK f6a9166
I reviewed the code and tested this on testnet during a period where blocks were mined multiple times in a minute - each time that happened, one more peer was added to the headers sync.

I think that this would be good to get in, especially with #25717 doubling the amount of headers download data.
While testing #25717 on a slow connection on testnet, I would sometimes run into timeouts - with 2,300,000 headers (~180MB) being downloaded twice (two phases) from 10 outbound peers, it might be necessary to be able to download up to ~3.6GB within 20 minutes - this can be a problem on slower connections.

While I think that this PR is a good improvement of the existing logic, I would probably prefer longer-term a deterministic mechanism instead of one connected to a random process - sometimes multiple blocks are found immediately after the start of headers sync, sometimes none for > 30 minutes, and I don't really see why this randomness should be tied to the number of peers for our headers download, instead of doing something like simply adding another additional peer every 10 minutes.

@fanquake fanquake requested a review from dergoegge August 15, 2022 09:22
@dergoegge
Copy link
Member

Code review ACK f6a9166

I think I agree with Martin, that in the long-term a deterministic mechanism would be preferable but for now this is a good improvement (especially w.r.t. #25717).

@fanquake fanquake added this to the 24.0 milestone Aug 15, 2022
@achow101
Copy link
Member

ACK f6a9166

@achow101 achow101 merged commit 22d96d7 into bitcoin:master Aug 15, 2022
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Aug 16, 2022
@bitcoin bitcoin locked and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.