[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

furszy · 2024-07-03T13:45:12Z

The 'not_found' message, introduced in protocol version 70001 (Bitcoin Core 0.8.0 - Feb 2013), serves as a response to a 'getdata' message when the receiving node does not have the requested data available for relay. Thus far, this has only been applied to transactions. This PR proposes extending its use to blocks as well.

Currently, when the remote peer does not have the requested block data, it ignores the 'getdata' inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered. As a result, the local node abruptly disconnects the peer due to perceived unresponsiveness in order to download the block from another source, even though the peer handled the message accordantly and may still be relaying other headers, blocks, and transactions properly.

This behavior is suboptimal for well-behaving nodes. Therefore, this PR proposes sending 'not_found' messages for blocks the remote peer does not know about or no longer stores, allowing the local node to promptly request them from another peer. Doing so will help prevent network synchronization stalling scenarios, allow us to shorten idle wait times (by reducing the block request timeout for peers supporting this feature), and ensure well-behaving peers remain connected.

Pending tasks:

Write detailed commits description.
Expand PR description
1. Describe feature negotiation message. Annotate all considered options.
2. Describe 'not_found' handler behavior for blocks closer to the tip vs historical ones.
Create BIP for the new feature negotiation message.
Think about increasing the peer's block download priority when it signals support for the feature.
Think about disconnecting the peer when a known feature negotiation message is received prior to the version message.
Complete TODOs in the code.

DrahtBot · 2024-07-03T13:45:16Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30393 (refactor: use existing RNG object in ProcessGetBlockData by maflcko)
#30364 (logging: Replace LogError and LogWarning with LogAlert by ryanofsky)
#30233 (refactor: move m_is_inbound out of CNodeState by sr-gi)
#28055 (Bugfix: net_processing: Restore "Already requested" error for FetchBlock by luke-jr)
#27052 (test: rpc: add last block announcement time to getpeerinfo result by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

dergoegge · 2024-07-03T13:59:07Z

Currently, when the remote peer does not have the requested block data, it ignores the 'getdata' inventory block request. This lack of response causes the requesting node to wait (and stall if the request was for an IBD block) for 10 minutes until the request timeout is triggered.

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don't have. Disconnection seems appropriate in both cases?

What is the problem this is trying to solve?

furszy · 2024-07-03T14:43:39Z

Nodes advertise which blocks they have and what their current tip is, so the timeouts should only happen if the peer is misbehaving or if you disrespect the protocol by asking for things they don't have.

We don't ask all peers about all the blocks they know about. If the node is in the main-chain, most of the time we only know their last advertised header only.
Once we are headers-wise sync, the block downloading procedure download blocks from different peers. It has a window of 16 in-flight blocks per peer. The peer who advertised the header might not be the one who ends up providing the final block.

This open a window for other scenarios such peers not responding when you ask for an old block that isn't in the main chain (fingerprinting protection), not responding when you ask for a block deeper than the pruning threshold (to avoid leaking the prune-height) or for a block that was just pruned (erased in-between the inv roundtrip). Which doesn't necessarily represent a bad behaving peer.

When this situations arises, well-behaving peers responding with a not_found will allow us to promptly request the block from another source instead of staying idle for 10 minutes and disconnect the well-behaving peer that is still in full compliance with the protocol specs. Preferring peers signaling this feature over others over time seems the correct path to follow to me.

sipa · 2024-07-03T14:48:57Z

The peer who advertised the header might not be the one who ends up providing the final block.

I don't think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

I can imagine asking a peer for a block that is not on the main chain (due to a very recent reorg) or due to just being pruned, though I don't know how often this happens.

I think this is generally a good change, but to judge how important it is, I think it's good to have numbers about whether this actually happens in production between honest nodes.

furszy · 2024-07-03T15:20:35Z

The peer who advertised the header might not be the one who ends up providing the final block.

I don't think we ever ask for a block from a peer without that specific peer having announced the block or a descendant to us?

Yeah. I actually spent some time checking that specifically because it would allow an easy way to partition the network by sending only the new header tip, disconnecting and expect the peer requests it to someone else who does not have it etc..

What I meant with the "The peer who advertised the header might not be the one who ends up providing the final block" is that providing a descendant of a block doesn't mean the peer has the block data in all circumstances. Thus why wrote those examples.

I think it's good to have numbers about whether this actually happens in production between honest nodes.

Ideally, yeah. But.. how would you get numbers when the disconnecting peer does not tell you the idle reason? Will think about this further.

Protocol wise, and generally speaking here, I don't see why a connection should stay idle for 10 minutes when two honest peers are connected. There should always be a response, even more when the other peer continue sending all other messages accordantly.

Introduces a new p2p protocol message 'SENDNOTFOUND' sent during the handshake phase to notify the remote peer we support receiving 'notfound' messages for unknown blocks.

No-behavior change. This will allow us to be consistent in terms of when the node will disconnect and also let us send a 'notfound' for blocks instead of disconnecting in a simpler manner.

…sion msg?

DrahtBot · 2024-07-09T13:42:03Z

🐙 This pull request conflicts with the target branch and needs rebase.

gmaxwell · 2024-08-03T20:11:45Z

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

If I'm mistaken about the current behavior, I think it should be changed to match that behavior because:

If you allow peers to probe exactly what blocks you have, it allows tracking nodes around the network as they change addresses and potentially exposing confidential information regarding which blocks were kept (e.g. if a node kept blocks containing transactions of relevance to them).

This was a specific consideration in the design of the limited flag, and is why there isn't some kind of fancy signaling about what blocks a node has or doesn't have.

Kind of separately, I think that sending "not found" is a regression to an older period of protocol development-- and that generally not founds have been replaced with not requesting things that won't be found and disconnecting when things are unexpected as disconnection is the strongest protection against accidentally jamming a peer (and yet still harmless if it caused by some kind of crazy race condition). Not founds were indeed used in the past but they were found to be largely a counterproductive bandwidth waste, generally useless, and a source of vulnerabilities (particularly privacy ones). So to a first approximation I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else.

furszy

Thanks for the comment gmaxwell.

I think if you find yourself thinking sending a not found is a solution to a problem, you've got a mistake somewhere else.

Yeah, thats actually how this working path started #28120.

I believed the current behavior was this: If a node is non-pruned it will have all blocks, no issue. If a node is pruned it will respond to requests for the last 288 which are protocol required and will disconnect any peer that asks for a block outside of the window. Again, no issue.

Those scenarios are ok, but what about the following ones:

For AssumeUTXO:
AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This causes honest peers trying to synchronize historical block from them to forcibly disconnect due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background one. This scenario can be replicated just by running an AssumeUTXO node and connecting a new node -> test case exercising the disconnection can be found here.

So, based on your comment, instead of implementing something like this, would your suggestion be to signal the AssumeUTXO sync state at the network level in some way? Because that sounds like signaling when (1) the node does not have the background chain and (2) when the node finishes syncing the background chain. Which sounds odd at first.
Probably not a fingerprinting concern because AssumeUTXO nodes should already be distinguishable enough just by observing their block requests. But.. I'm not so sure about it.
High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

gmaxwell · 2024-08-17T22:53:25Z

The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.) Signaling in a more fine grained manner would have little to no value to the network, but would enable tracking the node around as it moves from IP to IP. In any case, it's no justification for not-found even that aside as we wouldn't want nodes wasting their resources to connect to one only to be told not-found.

If they were used at all, it would be via signaling and so you'd know what they had to offer. The disadvantage of signing is all those tracking problems, but those problems exist for not-found too, so signaling is equivalent for privacy but avoids wasting time connecting to and requesting from things that can't help you.

An obvious way that might evolve is through the definition of a "less limited peer" that also has all the blocks in the current window of X blocks or whatnot. This was evaluated when limited was defined, but measurements showed that all older blocks past a VERY short interval were fetched at equal probability which resulted in the current limited definition. In a world where assumeutxo were in widespread use the fetching pattern would be different but the resulting fetching pattern would depend a lot on how assumeutxo values were chosen (as they need to be chosen in a way that avoids attack), and so at the time limited was defined it was decided to not presume anything about those future fetch patterns.

I believe the security/decenteralization considerations for assumeutxo have still just been ignored for the time being, so I don't think there are any clear answers on what fetching patterns it could expect. But it would be reasonable to assume that whatever they are they could be handled by a single additional node type.

High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

Context: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This, in conjunction with the current protocol behavior where nodes request blocks from peers that directly advertise them or their descendants, causes honest peers trying to synchronize historical blocks from AssumeUTXO nodes to forcibly disconnect, after stalling for ~10 minutes, due to the lack of response while the AssumeUTXO node is synchronizing the tip chain or the background chain. This lack of response happens because nodes ignore getdata requests when they don’t have the block data available (further discussion on this can be found in bitcoin#30385). Fix this by refraining from signaling full-node service while the background chain is being synced. During this period, the node will only signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will be re-enabled once the background chain sync is completed.

Context: AssumeUTXO nodes prioritize tip synchronization first, meaning they advertise knowledge of all blocks posterior to the snapshot base block hash. This, in conjunction with the current protocol behavior where nodes request blocks from peers that directly advertise them or their descendants, causes honest peers trying to synchronize historical blocks from AssumeUTXO nodes to forcibly disconnect, after stalling for ~10 minutes, due to a getdata lack of response while the AssumeUTXO node is synchronizing the tip chain or the background chain. This lack of response happens because nodes ignore getdata requests when they don’t have the block data available (further discussion on this can be found in bitcoin#30385). Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal 'NODE_NETWORK_LIMITED' support. Then, full-node ('NODE_NETWORK') support will be re-enabled once the background chain sync is completed.

furszy

The assumeutxo node would just be a limited node, it'll offer the most recent blocks and none of the history. (it might have more history then it offers, as is the case of any other limited node.)

Yeah, this was the piece I was missing. Thanks!
Funny, I thought about a dynamically adjustable network signal but didn't consider reusing the limited service flag by enabling/disabling full-node service support based on the assumeUTXO background chain-sync status..

Created #30807 implementing it. It fixes the last honest peer disconnection scenario I had in mind.

High-Bandwidth Peers:
BIP152 mentions the possibility of relaying PoW-valid headers before fully validating the block data. So, what if the block > turns out to be invalid? the relayer will ignore any following-up request and lead to disconnection.

They don't relay just headers, they relay the block, which is why the BIP directs peers to not ban each other for sending an invalid block this way so long as it was POW valid.

Yeah ok, great.

…during IBD 992f83b test: add coverage for assumeUTXO honest peers disconnection (furszy) 6d5812e assumeUTXO: fix peers disconnection during sync (furszy) Pull request description: Because AssumeUTXO nodes prioritize tip synchronization, they relay their local address through the network before completing the background chain sync. This, combined with the advertising of full-node service (`NODE_NETWORK`), can result in an honest peer in IBD connecting to the AssumeUTXO node (while syncing) and requesting an historical block the node does not have. This behavior leads to an abrupt disconnection due to perceived unresponsiveness from the AssumeUTXO node. This lack of response occurs because nodes ignore `getdata` requests when they do not have the block data available (further discussion can be found in #30385). Fix this by refraining from signaling full-node service support while the background chain is being synced. During this period, the node will only signal `NODE_NETWORK_LIMITED` support. Then, full-node (`NODE_NETWORK`) support will be re-enabled once the background chain sync is completed. Thanks mzumsande for a post-#30385 convo too. Testing notes: Just cherry-pick the second commit (bb08c22) on master. It will fail there, due to the IBD node requesting historical blocks to the snapshot node - which is bad because the snapshot node will ignore the requests and stall + disconnect after some time. ACKs for top commit: achow101: ACK 992f83b naumenkogs: ACK 992f83b mzumsande: ACK 992f83b Tree-SHA512: fef525d1cf3200c2dd89a346be9c82d77f2e28ddaaea1f490a435e180d1a47a371cadea508349777d740ab56e94be536ad8f7d61cc81f6550c58b609b3779ed3

DrahtBot added the P2P label Jul 3, 2024

furszy changed the title ~~p2p: send not_found msgs for unknown, pruned or unwilling to share blocks~~ [WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks Jul 3, 2024

DrahtBot mentioned this pull request Jul 3, 2024

logging: Replace LogError and LogWarning with LogAlert #30364

Closed

furszy added 9 commits July 4, 2024 19:30

p2p: negotiate NOT_FOUND support for blocks

ac0c285

Introduces a new p2p protocol message 'SENDNOTFOUND' sent during the handshake phase to notify the remote peer we support receiving 'notfound' messages for unknown blocks.

net: move disconnection outside the 'getdata' handler

a5c0fcb

No-behavior change. This will allow us to be consistent in terms of when the node will disconnect and also let us send a 'notfound' for blocks instead of disconnecting in a simpler manner.

p2p: send NOT_FOUND for unknown/pruned blocks

b8e3502

DRAFT: disconnect when a negotiation msg is received prior to the ver…

9b2e70d

…sion msg?

test: validate 'notfound' block responses

cfc3798

net: handle 'notfound' block msgs

9d61bfd

Drop: note for the future

5e1023d

RPC: getpeerinfo, return peer missing blocks

7a354a7

test: coverage for 'notfound' handler

5b6317c

furszy force-pushed the 2024_p2p_notfound_block_v2 branch from 9d924a7 to 5b6317c Compare July 4, 2024 22:31

DrahtBot added the Needs rebase label Jul 9, 2024

furszy commented Aug 14, 2024

View reviewed changes

furszy mentioned this pull request Sep 3, 2024

Fix peers abruptly disconnecting from AssumeUTXO nodes during IBD #30807

Merged

furszy commented Sep 4, 2024

View reviewed changes

furszy closed this Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

Uh oh!

furszy commented Jul 3, 2024

Uh oh!

DrahtBot commented Jul 3, 2024 •

edited

Loading

Uh oh!

dergoegge commented Jul 3, 2024

Uh oh!

furszy commented Jul 3, 2024 •

edited

Loading

Uh oh!

sipa commented Jul 3, 2024

Uh oh!

furszy commented Jul 3, 2024 •

edited

Loading

Uh oh!

DrahtBot commented Jul 9, 2024

Uh oh!

gmaxwell commented Aug 3, 2024 •

edited

Loading

Uh oh!

furszy left a comment •

edited

Loading

Uh oh!

gmaxwell commented Aug 17, 2024 •

edited

Loading

Uh oh!

furszy left a comment

Uh oh!

Uh oh!

[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

[WIP] p2p: send not_found msgs for unknown, pruned or unwilling to share blocks #30385

Uh oh!

Conversation

furszy commented Jul 3, 2024

Uh oh!

DrahtBot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage

Reviews

Conflicts

Uh oh!

dergoegge commented Jul 3, 2024

Uh oh!

furszy commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipa commented Jul 3, 2024

Uh oh!

furszy commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Jul 9, 2024

Uh oh!

gmaxwell commented Aug 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furszy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmaxwell commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furszy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DrahtBot commented Jul 3, 2024 •

edited

Loading

furszy commented Jul 3, 2024 •

edited

Loading

furszy commented Jul 3, 2024 •

edited

Loading

gmaxwell commented Aug 3, 2024 •

edited

Loading

furszy left a comment •

edited

Loading

gmaxwell commented Aug 17, 2024 •

edited

Loading