-
Notifications
You must be signed in to change notification settings - Fork 37.7k
rpc: Pruning nodes can not fetch blocks before syncing past their height #23927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK. I started testing it by creating a functional test. Not sure if the approach is right, I started two nodes, the second one with def run_test(self):
self.log.info("Mine 20 blocks on Node 0")
self.generate(self.nodes[0], 200, sync_fun=self.no_op)
assert_equal(self.nodes[0].getblockcount(), 400)
self.log.info("Connect nodes")
self.connect_nodes(0, 1)
peers = self.nodes[1].getpeerinfo()
assert_equal(len(peers), 1)
peer_1_peer_0_id = peers[0]["id"]
best_block_hash_0 = self.nodes[0].getbestblockhash()
assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id) In master branch, this test would fail, because an exeception won't be raised. |
Hey @brunoerg , thanks for giving it a try but unfortunately I don't think this approach works reliably. The problem is that the outcome of this test is a race because node 1 is downloading the blocks from node 0 in the background. It may have the current tip in flight by the time the assert is called, or not. We try to avoid tests where the outcome is not 100% reliable because we are seeing intermittent test failures quite often already. What would be needed is a reliable way to let node 1 sync the headers but then prevent it from syncing the blocks. It seems we don't have something like this and I am now thinking about how this could be done and where else it could be useful. |
Interesting, this is new for me. def setup_network(self):
self.setup_nodes() even with this config (setup_network) node1 will download the blocks from node0 in the background after node0 mines more 200 blocks? def run_test(self):
self.log.info("Mine 200 blocks on Node 0")
self.generate(self.nodes[0], 200, sync_fun=self.no_op)
assert_equal(self.nodes[0].getblockcount(), 400)
self.log.info("Connect nodes")
self.connect_nodes(0, 1)
peers = self.nodes[1].getpeerinfo()
assert_equal(len(peers), 1)
peer_1_peer_0_id = peers[0]["id"]
best_block_hash_0 = self.nodes[0].getbestblockhash()
assert_raises_rpc_error(-1, 'In prune mode, only blocks that the node has already synced previously can be fetched from a peer', self.nodes[1].getblockfrompeer, best_block_hash_0, peer_1_peer_0_id)
self.sync_blocks()
self.nodes[1].getblockfrompeer(best_block_hash_0, peer_1_peer_0_id) I thought the first |
No, the nodes indeed can not sync if they are not connected. But you are connecting the nodes in your test here:
What |
I think we have a working test now. I remembered that we can submit the header via a |
Yes, my bad. That approach would work because node0 is mining 200 blocks more, so probably node1 didn't get the last one before the assertion but they're syncing and it could cause intermittent failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 105b287
Compiled the branch code on MacOS 12 and started bitcoind (empty):
./src/bitcoind --prune=550 --daemon
Got a hash of a recent block from a block explorer: 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b
Executed getpeerinfo
to see the connections and get some peers id to use.
And then, executed getblockfrompeer
with 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b as block hash.
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 11
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 8
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
➜ bitcoin git:(fjahr) ✗ ./src/bitcoin-cli getblockfrompeer 00000000000000000001fede733d9ad94b9a9cdb07237cba25c556f8f807db4b 5
error code: -1
error message:
In prune mode, only blocks that the node has already synced previously can be fetched from a peer
Concept ACK, but light selfish preference for getting #23706 in first, since there's a (small) conflict. |
Weak concept NACK. I think it's better to allow it. Pruning is only best-effort anyway, not a guarantee. |
105b287
to
b750720
Compare
Rebased
Why do you think it's better to allow it? Do you have any specific use cases in mind? Of course there is no guarantee to stay below the exact number but for |
b750720
to
3ec2cba
Compare
In the context of ForkMonitor I use this feature to fetch blocks that are, by definition, at or below the tip height. I.e. stale blocks. This PR doesn't impact that use case, because these nodes are always up to date. In fact, this PR adds some safety for when a fresh node is added to the site, or an existing node is reinstalled, and it's still catching up (though in practice we don't call Another use case that seems obvious to me is to fetch an historical block, perhaps because you're rescanning a wallet. In that case I don't see any harm in waiting until the node is synced. |
3ec2cba
to
92f4ca2
Compare
Rebased |
While a node is still catching up to the tip that it is aware of via the headers, the user can currently use to fetch blocks close to the tip. These blocks are stored in the current block/rev file which otherwise contains blocks the node is receiving as part of the syncing process. This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file will not be pruned until the tip have moved on far enough from the fetched block. In extreme cases with heavy pruning (550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.
92f4ca2
to
5826bf5
Compare
Rebased |
utACK 5826bf5 Very nice test. |
ACK 5826bf5 |
@@ -453,6 +453,12 @@ static RPCHelpMan getblockfrompeer() | |||
throw JSONRPCError(RPC_MISC_ERROR, "Block header missing"); | |||
} | |||
|
|||
// Fetching blocks before the node has syncing past their height can prevent block files from | |||
// being pruned, so we avoid it if the node is in prune mode. | |||
if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 7fa851f: (non-blocking nit)
Why not use IsBlockPruned
instead?
Which should be the same as saying that, on pruning mode, can only fetch blocks that were downloaded and discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If IsBlockPruned
was used instead, fetching an older block that isn't pruned (blocks close to the tip for instance) on a pruned node would result in the error In prune mode, only blocks that the node has already synced previously can be fetched from a peer
instead of Block already downloaded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't chainman.ActiveChain()
need to be called with cs_main
locked?
CChain& ActiveChain() const EXCLUSIVE_LOCKS_REQUIRED(GetMutex()) { return ActiveChainstate().m_chain; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed it does. I really need to be building clang...
rpc/blockchain.cpp:464:35: warning: calling function 'ActiveChain' requires holding mutex 'cs_main' exclusively [-Wthread-safety-analysis]
if (index->nHeight > chainman.ActiveChain().Tip()->nHeight && node::fPruneMode) {
^
1 warning generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 5826bf5
I tested the behavior by invalidating the tip with invalidateblock
and trying to fetch it again with getblockfrompeer
which resulted in In prune mode, only blocks that the node has already synced previously can be fetched from a peer
as expected.
f5ff3d7 rpc: add missing lock around chainman.ActiveTip() (Andrew Toth) Pull request description: #23927 seems to have missed a lock around `chainman.ActiveChain()`. ACKs for top commit: aureleoules: ACK f5ff3d7 Tree-SHA512: 3f116ca44c1b2bc0c7042698249ea3417dfb7c0bb81158a7ceecd087f1e02baa89948f9bb7924b1757798a1691a7de6e886aa72a0a9e227c13a3f512cc59d6c9
f5ff3d7 rpc: add missing lock around chainman.ActiveTip() (Andrew Toth) Pull request description: bitcoin#23927 seems to have missed a lock around `chainman.ActiveChain()`. ACKs for top commit: aureleoules: ACK f5ff3d7 Tree-SHA512: 3f116ca44c1b2bc0c7042698249ea3417dfb7c0bb81158a7ceecd087f1e02baa89948f9bb7924b1757798a1691a7de6e886aa72a0a9e227c13a3f512cc59d6c9
This PR prevents
getblockfrompeer
from getting used on blocks that the node has not synced past yet if the node is in running in prune mode.Problem
While a node is still catching up to the tip that it is aware of via the headers, the user can currently use to fetch blocks close to or at the tip. These blocks are stored in the block/rev file that otherwise contains blocks the node is receiving as part of the syncing process.
This creates a problem for pruned nodes: The files containing a fetched block are not pruned during syncing because they contain a block close to the tip. This means the entire file (~130MB) will not be pruned until the tip has moved on far enough from the fetched block. In extreme cases with heavy pruning (like 550) and multiple blocks being fetched this could mean that the disc usage far exceeds what the user expects, potentially running out of space.
Approach
There would be certainly other approaches that could fix the problem while still allowing the current behavior, but all of the ideas I came up with seemed like overkill for a niche problem on a new RPC where it's still unclear how and how much it will be used.
Testing
So far I did not see a simple enough way to test this I am still looking into it and if it's complex will potentially add it in a follow-up. What would be needed is a way to have a node fetch headers but not sync the blocks yet, that seems like a pattern that could be generally useful.
To manually reproduce the problematic behavior:
master
with-prune=550
and an empty/new datadir, Testnet and Mainnet should both work.getblockfrompeer
on the current tip and a few other recent blocks.blk*.dat
andrev*.dat
files that are not being pruned. When you "pinned" a few of these files the blocks folder should be significantly above the target size of 550MB.