-
Notifications
You must be signed in to change notification settings - Fork 37.7k
[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don't have. Disconnection seems appropriate in both cases? What is the problem this is trying to solve? |
We don't ask all peers about all the blocks they know about. If the node is in the main-chain, most of the time we only know their last advertised header only. This open a window for other scenarios such peers not responding when you ask for an old block that isn't in the main chain (fingerprinting protection), not responding when you ask for a block deeper than the pruning threshold (to avoid leaking the prune-height) or for a block that was just pruned (erased in-between the inv roundtrip). Which doesn't necessarily represent a bad behaving peer. When this situations arises, well-behaving peers responding with a not_found will allow us to promptly request the block from another source instead of staying idle for 10 minutes and disconnect the well-behaving peer that is still in full compliance with the protocol specs. Preferring peers signaling this feature over others over time seems the correct path to follow to me. |
I don't think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us? I can imagine asking a peer for a block that is not on the main chain (due to a very recent reorg) or due to just being pruned, though I don't know how often this happens. I think this is generally a good change, but to judge how important it is, I think it's good to have numbers about whether this actually happens in production between honest nodes. |
Yeah. I actually spent some time checking that specifically because it would allow an easy way to partition the network by sending only the new header tip, disconnecting and expect the peer requests it to someone else who does not have it etc.. What I meant with the "The peer who advertised the header might not be the one who ends up providing the final block" is that providing a descendant of a block doesn't mean the peer has the block data in all circumstances. Thus why wrote those examples.
Ideally, yeah. But.. how would you get numbers when the disconnecting peer does not tell you the idle reason? Will think about this further. Protocol wise, and generally speaking here, I don't see why a connection should stay idle for 10 minutes when two honest peers are connected. There should always be a response, even more when the other peer continue sending all other messages accordantly. |
Introduces a new p2p protocol message 'SENDNOTFOUND' sent during the handshake phase to notify the remote peer we support receiving 'notfound' messages for unknown blocks.
No-behavior change. This will allow us to be consistent in terms of when the node will disconnect and also let us send a 'notfound' for blocks instead of disconnecting in a simpler manner.
9d924a7
to
5b6317c
Compare
🐙 This pull request conflicts with the target branch and needs rebase. |
I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue. If I'm mistaken about the current behavior, I think it should be changed to match that behavior because: If you allow peers to probe exactly what blocks you have, it allows tracking nodes around the network as they change addresses and potentially exposing confidential information regarding which blocks were kept (e.g. if a node kept blocks containing transactions of relevance to them). This was a specific consideration in the design of the limited flag, and is why there isn't some kind of fancy signaling about what blocks a node has or doesn't have. Kind of separately, I think that sending "not found" is a regression to an older period of protocol development-- and that generally not founds have been replaced with not requesting things that won't be found and disconnecting when things are unexpected as disconnection is the strongest protection against accidentally jamming a peer (and yet still harmless if it caused by some kind of crazy race condition). Not founds were indeed used in the past but they were found to be largely a counterproductive bandwidth waste, generally useless, and a source of vulnerabilities (particularly privacy ones). So to a first approximation I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment gmaxwell.
I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else.
Yeah, thats actually how this working path started #28120.
I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.
Those scenarios are ok, but what about the following ones:
-
For AssumeUTXO:
AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This causes honest peers trying to synchronize historical block from them to forcibly disconnect due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background one. This scenario can be replicated just by running an AssumeUTXO node and connecting a new node -> test case exercising the disconnection can be found here.So, based on your comment, instead of implementing something like this, would your suggestion be to signal the AssumeUTXO sync state at the network level in some way? Because that sounds like signaling when (1) the node does not have the background chain and (2) when the node finishes syncing the background chain. Which sounds odd at first.
Probably not a fingerprinting concern because AssumeUTXO nodes should already be distinguishable enough just by observing their block requests. But.. I'm not so sure about it. -
High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.
The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.) Signaling in a more fine grained manner would have little to no value to the network, but would enable tracking the node around as it moves from IP to IP. In any case, it's no justification for not-found even that aside as we wouldn't want nodes wasting their resources to connect to one only to be told not-found. If they were used at all, it would be via signaling and so you'd know what they had to offer. The disadvantage of signing is all those tracking problems, but those problems exist for not-found too, so signaling is equivalent for privacy but avoids wasting time connecting to and requesting from things that can't help you. An obvious way that might evolve is through the definition of a "less limited peer" that also has all the blocks in the current window of X blocks or whatnot. This was evaluated when limited was defined, but measurements showed that all older blocks past a VERY short interval were fetched at equal probability which resulted in the current limited definition. In a world where assumeutxo were in widespread use the fetching pattern would be different but the resulting fetching pattern would depend a lot on how assumeutxo values were chosen (as they need to be chosen in a way that avoids attack), and so at the time limited was defined it was decided to not presume anything about those future fetch patterns. I believe the security/decenteralization considerations for assumeutxo have still just been ignored for the time being, so I don't think there are any clear answers on what fetching patterns it could expect. But it would be reasonable to assume that whatever they are they could be handled by a single additional node type.
They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid. |
Context: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This, in conjunction with the current protocol behavior where nodes request blocks from peers that directly advertise them or their descendants, causes honest peers trying to synchronize historical blocks from AssumeUTXO nodes to forcibly disconnect, after stalling for ~10 minutes, due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background chain. This lack of response happens because nodes ignore getdata requests when they don’t have the block data available (further discussion on this can be found in bitcoin#30385). Fix this by refraining from signaling full-node service while the background chain is being synced. During this period, the node will only signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will be re-enabled once the background chain sync is completed.
Context: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This, in conjunction with the current protocol behavior where nodes request blocks from peers that directly advertise them or their descendants, causes honest peers trying to synchronize historical blocks from AssumeUTXO nodes to forcibly disconnect, after stalling for ~10 minutes, due to a getdata lack of response while the AssumeUTXO node is synchronizing the tip chain or the background chain. This lack of response happens because nodes ignore getdata requests when they don’t have the block data available (further discussion on this can be found in bitcoin#30385). Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will be re-enabled once the background chain sync is completed.
Context: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This, in conjunction with the current protocol behavior where nodes request blocks from peers that directly advertise them or their descendants, causes honest peers trying to synchronize historical blocks from AssumeUTXO nodes to forcibly disconnect, after stalling for ~10 minutes, due to a getdata lack of response while the AssumeUTXO node is synchronizing the tip chain or the background chain. This lack of response happens because nodes ignore getdata requests when they don’t have the block data available (further discussion on this can be found in bitcoin#30385). Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will be re-enabled once the background chain sync is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.)
Yeah, this was the piece I was missing. Thanks!
Funny, I thought about a dynamically adjustable network signal but didn't consider reusing the limited service flag by enabling/disabling full-node service support based on the assumeUTXO background chain-sync status..
Created #30807 implementing it. It fixes the last honest peer disconnection scenario I had in mind.
High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.
Yeah ok, great.
…during IBD 992f83b test: add coverage for assumeUTXO honest peers disconnection (furszy) 6d5812e assumeUTXO: fix peers disconnection during sync (furszy) Pull request description: Because AssumeUTXO nodes prioritize tip synchronization, they relay their local address through the network before completing the background chain sync. This, combined with the advertising of full-node service (`NODE_NETWORK`), can result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing) and requesting an historical block the node does not have. This behavior leads to an abrupt disconnection due to perceived unresponsiveness from the AssumeUTXO node. This lack of response occurs because nodes ignore `getdata` requests when they do not have the block data available (further discussion can be found in #30385). Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal `NODE_NETWORK_LIMITED` support. Then, full-node (`NODE_NETWORK`) support will be re-enabled once the background chain sync is completed. Thanks mzumsande for a post-#30385 convo too. Testing notes: Just cherry-pick the second commit (bb08c22) on master. It will fail there, due to the IBD node requesting historical blocks to the snapshot node - which is bad because the snapshot node will ignore the requests and stall + disconnect after some time. ACKs for top commit: achow101: ACK 992f83b naumenkogs: ACK 992f83b mzumsande: ACK 992f83b Tree-SHA512: fef525d1cf3200c2dd89a346be9c82d77f2e28ddaaea1f490a435e180d1a47a371cadea508349777d740ab56e94be536ad8f7d61cc81f6550c58b609b3779ed3
The 'not_found' message, introduced in protocol version 70001 (Bitcoin Core 0.8.0 - Feb 2013), serves as a response to a 'getdata' message when the receiving node does not have the requested data available for relay. Thus far, this has only been applied to transactions. This PR proposes extending its use to blocks as well.
Currently, when the remote peer does not have the requested block data, it ignores the 'getdata' inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered. As a result, the local node abruptly disconnects the peer due to perceived unresponsiveness in order to download the block from another source, even though the peer handled the message accordantly and may still be relaying other headers, blocks, and transactions properly.
This behavior is suboptimal for well-behaving nodes. Therefore, this PR proposes sending 'not_found' messages for blocks the remote peer does not know about or no longer stores, allowing the local node to promptly request them from another peer. Doing so will help prevent network synchronization stalling scenarios, allow us to shorten idle wait times (by reducing the block request timeout for peers supporting this feature), and ensure well-behaving peers remain connected.
Pending tasks: