Skip to content

Conversation

furszy
Copy link
Member

@furszy furszy commented Jul 3, 2024

The 'not_found' message, introduced in protocol version 70001 (Bitcoin Core 0.8.0 - Feb 2013), serves as a response to a 'getdata' message when the receiving node does not have the requested data available for relay. Thus far, this has only been applied to transactions. This PR proposes extending its use to blocks as well.

Currently, when the remote peer does not have the requested block data, it ignores the 'getdata' inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered. As a result, the local node abruptly disconnects the peer due to perceived unresponsiveness in order to download the block from another source, even though the peer handled the message accordantly and may still be relaying other headers, blocks, and transactions properly.

This behavior is suboptimal for well-behaving nodes. Therefore, this PR proposes sending 'not_found' messages for blocks the remote peer does not know about or no longer stores, allowing the local node to promptly request them from another peer. Doing so will help prevent network synchronization stalling scenarios, allow us to shorten idle wait times (by reducing the block request timeout for peers supporting this feature), and ensure well-behaving peers remain connected.

Pending tasks:

  1. Write detailed commits description.
  2. Expand PR description
    1. Describe feature negotiation message. Annotate all considered options.
    2. Describe 'not_found' handler behavior for blocks closer to the tip vs historical ones.
  3. Create BIP for the new feature negotiation message.
  4. Think about increasing the peer's block download priority when it signals support for the feature.
  5. Think about disconnecting the peer when a known feature negotiation message is received prior to the version message.
  6. Complete TODOs in the code.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 3, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #30393 (refactor: use existing RNG object in ProcessGetBlockData by maflcko)
  • #30364 (logging: Replace LogError and LogWarning with LogAlert by ryanofsky)
  • #30233 (refactor: move m_is_inbound out of CNodeState by sr-gi)
  • #28055 (Bugfix: net_processing: Restore "Already requested" error for FetchBlock by luke-jr)
  • #27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the P2P label Jul 3, 2024
@furszy furszy changed the title p2p: send not_found msgs for unknown, pruned or unwilling to share blocks [WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks Jul 3, 2024
@dergoegge
Copy link
Member

Currently, when the remote peer does not have the requested block data, it ignores the 'getdata' inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered.

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don't have. Disconnection seems appropriate in both cases?

What is the problem this is trying to solve?

@furszy
Copy link
Member Author

furszy commented Jul 3, 2024

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don't have.

We don't ask all peers about all the blocks they know about. If the node is in the main-chain, most of the time we only know their last advertised header only.
Once we are headers-wise sync, the block downloading procedure download blocks from different peers. It has a window of 16 in-flight blocks per peer. The peer who advertised the header might not be the one who ends up providing the final block.

This open a window for other scenarios such peers not responding when you ask for an old block that isn't in the main chain (fingerprinting protection), not responding when you ask for a block deeper than the pruning threshold (to avoid leaking the prune-height) or for a block that was just pruned (erased in-between the inv roundtrip). Which doesn't necessarily represent a bad behaving peer.

When this situations arises, well-behaving peers responding with a not_found will allow us to promptly request the block from another source instead of staying idle for 10 minutes and disconnect the well-behaving peer that is still in full compliance with the protocol specs. Preferring peers signaling this feature over others over time seems the correct path to follow to me.

@sipa
Copy link
Member

sipa commented Jul 3, 2024

The peer who advertised the header might not be the one who ends up providing the final block.

I don't think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

I can imagine asking a peer for a block that is not on the main chain (due to a very recent reorg) or due to just being pruned, though I don't know how often this happens.

I think this is generally a good change, but to judge how important it is, I think it's good to have numbers about whether this actually happens in production between honest nodes.

@furszy
Copy link
Member Author

furszy commented Jul 3, 2024

The peer who advertised the header might not be the one who ends up providing the final block.

I don't think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

Yeah. I actually spent some time checking that specifically because it would allow an easy way to partition the network by sending only the new header tip, disconnecting and expect the peer requests it to someone else who does not have it etc..

What I meant with the "The peer who advertised the header might not be the one who ends up providing the final block" is that providing a descendant of a block doesn't mean the peer has the block data in all circumstances. Thus why wrote those examples.

I think it's good to have numbers about whether this actually happens in production between honest nodes.

Ideally, yeah. But.. how would you get numbers when the disconnecting peer does not tell you the idle reason? Will think about this further.

Protocol wise, and generally speaking here, I don't see why a connection should stay idle for 10 minutes when two honest peers are connected. There should always be a response, even more when the other peer continue sending all other messages accordantly.

furszy added 9 commits July 4, 2024 19:30
Introduces a new p2p protocol message 'SENDNOTFOUND' sent during the
handshake phase to notify the remote peer we support receiving
'notfound' messages for unknown blocks.
No-behavior change.

This will allow us to be consistent in terms of when the node
will disconnect and also let us send a 'notfound' for blocks
instead of disconnecting in a simpler manner.
@furszy furszy force-pushed the 2024_p2p_notfound_block_v2 branch from 9d924a7 to 5b6317c Compare July 4, 2024 22:31
@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 9, 2024

🐙 This pull request conflicts with the target branch and needs rebase.

@gmaxwell
Copy link
Contributor

gmaxwell commented Aug 3, 2024

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

If I'm mistaken about the current behavior, I think it should be changed to match that behavior because:

If you allow peers to probe exactly what blocks you have, it allows tracking nodes around the network as they change addresses and potentially exposing confidential information regarding which blocks were kept (e.g. if a node kept blocks containing transactions of relevance to them).

This was a specific consideration in the design of the limited flag, and is why there isn't some kind of fancy signaling about what blocks a node has or doesn't have.

Kind of separately, I think that sending "not found" is a regression to an older period of protocol development-- and that generally not founds have been replaced with not requesting things that won't be found and disconnecting when things are unexpected as disconnection is the strongest protection against accidentally jamming a peer (and yet still harmless if it caused by some kind of crazy race condition). Not founds were indeed used in the past but they were found to be largely a counterproductive bandwidth waste, generally useless, and a source of vulnerabilities (particularly privacy ones). So to a first approximation I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else.

Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment gmaxwell.

I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else.

Yeah, thats actually how this working path started #28120.

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

Those scenarios are ok, but what about the following ones:

  • For AssumeUTXO:
    AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This causes honest peers trying to synchronize historical block from them to forcibly disconnect due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background one. This scenario can be replicated just by running an AssumeUTXO node and connecting a new node -> test case exercising the disconnection can be found here.

    So, based on your comment, instead of implementing something like this, would your suggestion be to signal the AssumeUTXO sync state at the network level in some way? Because that sounds like signaling when (1) the node does not have the background chain and (2) when the node finishes syncing the background chain. Which sounds odd at first.
    Probably not a fingerprinting concern because AssumeUTXO nodes should already be distinguishable enough just by observing their block requests. But.. I'm not so sure about it.

  • High-Bandwidth Peers:
    BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

@gmaxwell
Copy link
Contributor

gmaxwell commented Aug 17, 2024

The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.) Signaling in a more fine grained manner would have little to no value to the network, but would enable tracking the node around as it moves from IP to IP. In any case, it's no justification for not-found even that aside as we wouldn't want nodes wasting their resources to connect to one only to be told not-found.

If they were used at all, it would be via signaling and so you'd know what they had to offer. The disadvantage of signing is all those tracking problems, but those problems exist for not-found too, so signaling is equivalent for privacy but avoids wasting time connecting to and requesting from things that can't help you.

An obvious way that might evolve is through the definition of a "less limited peer" that also has all the blocks in the current window of X blocks or whatnot. This was evaluated when limited was defined, but measurements showed that all older blocks past a VERY short interval were fetched at equal probability which resulted in the current limited definition. In a world where assumeutxo were in widespread use the fetching pattern would be different but the resulting fetching pattern would depend a lot on how assumeutxo values were chosen (as they need to be chosen in a way that avoids attack), and so at the time limited was defined it was decided to not presume anything about those future fetch patterns.

I believe the security/decenteralization considerations for assumeutxo have still just been ignored for the time being, so I don't think there are any clear answers on what fetching patterns it could expect. But it would be reasonable to assume that whatever they are they could be handled by a single additional node type.

High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

furszy added a commit to furszy/bitcoin-core that referenced this pull request Sep 3, 2024
Context:
AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise
knowledge of all blocks posterior to the snapshot base block hash. This, in
conjunction with the current protocol behavior where nodes request blocks
from peers that directly advertise them or their descendants, causes honest
peers trying to synchronize historical blocks from AssumeUTXO nodes to
forcibly disconnect, after stalling for ~10 minutes, due to the lack of
response while the AssumeUTXO node is synchronizing the tip chain or the
background chain.

This lack of response happens because nodes ignore getdata requests when they
don’t have the block data available (further discussion on this can be found
in bitcoin#30385).

Fix this by refraining from signaling full-node service while the background
chain is being synced. During this period, the node will only signal
'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will
be re-enabled once the background chain sync is completed.
furszy added a commit to furszy/bitcoin-core that referenced this pull request Sep 3, 2024
Context:
AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise
knowledge of all blocks posterior to the snapshot base block hash. This, in
conjunction with the current protocol behavior where nodes request blocks
from peers that directly advertise them or their descendants, causes honest
peers trying to synchronize historical blocks from AssumeUTXO nodes to
forcibly disconnect, after stalling for ~10 minutes, due to a getdata lack
of response while the AssumeUTXO node is synchronizing the tip chain or the
background chain.

This lack of response happens because nodes ignore getdata requests when they
don’t have the block data available (further discussion on this can be found
in bitcoin#30385).

Fix this by refraining from signaling full-node service support while the
background chain is being synced. During this period, the node will only
signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK')
support will be re-enabled once the background chain sync is completed.
furszy added a commit to furszy/bitcoin-core that referenced this pull request Sep 4, 2024
Context:
AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise
knowledge of all blocks posterior to the snapshot base block hash. This, in
conjunction with the current protocol behavior where nodes request blocks
from peers that directly advertise them or their descendants, causes honest
peers trying to synchronize historical blocks from AssumeUTXO nodes to
forcibly disconnect, after stalling for ~10 minutes, due to a getdata lack
of response while the AssumeUTXO node is synchronizing the tip chain or the
background chain.

This lack of response happens because nodes ignore getdata requests when they
don’t have the block data available (further discussion on this can be found
in bitcoin#30385).

Fix this by refraining from signaling full-node service support while the
background chain is being synced. During this period, the node will only
signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK')
support will be re-enabled once the background chain sync is completed.
Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.)

Yeah, this was the piece I was missing. Thanks!
Funny, I thought about a dynamically adjustable network signal but didn't consider reusing the limited service flag by enabling/disabling full-node service support based on the assumeUTXO background chain-sync status..

Created #30807 implementing it. It fixes the last honest peer disconnection scenario I had in mind.

High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

Yeah ok, great.

achow101 added a commit that referenced this pull request Sep 11, 2024
…during IBD

992f83b test: add coverage for assumeUTXO honest peers disconnection (furszy)
6d5812e assumeUTXO: fix peers disconnection during sync (furszy)

Pull request description:

  Because AssumeUTXO nodes prioritize tip synchronization, they relay their local
  address through the network before completing the background chain sync.
  This, combined with the advertising of full-node service (`NODE_NETWORK`), can
  result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing)
  and requesting an historical block the node does not have. This behavior leads to
  an abrupt disconnection due to perceived unresponsiveness from the AssumeUTXO
  node.

  This lack of response occurs because nodes ignore `getdata` requests when they do
  not have the block data available (further discussion can be found in #30385).

  Fix this by refraining from signaling full-node service support while the
  background chain is being synced. During this period, the node will only
  signal `NODE_NETWORK_LIMITED` support. Then, full-node (`NODE_NETWORK`)
  support will be re-enabled once the background chain sync is completed.

  Thanks mzumsande for a post-#30385 convo too.

  Testing notes:
  Just cherry-pick the second commit (bb08c22) on master.
  It will fail there, due to the IBD node requesting historical blocks to the snapshot
  node - which is bad because the snapshot node will ignore the requests and
  stall + disconnect after some time.

ACKs for top commit:
  achow101:
    ACK 992f83b
  naumenkogs:
    ACK 992f83b
  mzumsande:
    ACK 992f83b

Tree-SHA512: fef525d1cf3200c2dd89a346be9c82d77f2e28ddaaea1f490a435e180d1a47a371cadea508349777d740ab56e94be536ad8f7d61cc81f6550c58b609b3779ed3
@furszy furszy closed this Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants