net: don't lock cs_main while reading blocks in net processing #26326

andrewtoth · 2022-10-17T15:34:13Z

Inspired by #11913 and #26308.

cs_main doesn't need to be locked while reading blocks. This removes the locks in net_processing.

DrahtBot · 2022-10-17T18:24:28Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	sr-gi, furszy, mzumsande, TheCharlatan, achow101
Concept ACK	dergoegge, pablomartin4btc
Approach ACK	hernanmarino

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

dergoegge · 2022-10-18T14:42:08Z

Concept ACK

I would suggest that you rebase this PR on top of #26316 (if there is a dependency) and then maybe also mark this PR as a draft until the base is merged.

andrewtoth · 2022-11-18T16:11:12Z

I don't think #26316 is necessary anymore. We can simply remove files that are still on disk like in #26533.

src/net_processing.cpp

src/node/blockstorage.h

jamesob · 2022-12-08T16:11:30Z

Cool. Can someone add a "needs benchmark" label?

andrewtoth · 2022-12-08T17:14:36Z

Can someone add a "needs benchmark" label?

What kind of benchmark would be appropriate here? AFAICT message processing is done on a single thread, so this patch wouldn't speed up handling requests. It would free up other threads that are waiting on the lock as the message processing thread does lengthy disk IO.

hernanmarino

Approach ACK. Run all tests, everything run fine.

pablomartin4btc

cr ACK. I like the idea and I think this change is very useful for performance purposes.
I ran unit tests and functional ones with no errors or warnings.
Pending of benchmarking if I manage to find the way, otherwise when someone posts here I'm happy to re-evaluate.

andrewtoth · 2022-12-19T03:47:08Z

I wrote a tool to allow benchmarking the affected code paths https://github.com/andrewtoth/spam_block_reqs. I ran it on block 768022 using 5000 requests for each of the four paths on both this branch and master. For example:

cargo run --release -- -r block-transactions -b 000000000000000000064b1c6e9606714fd4a96d6ff877e4a93b170d86361e28 -n 5000

I did not benchmark the MSG_FILTERED_BLOCK path because it isn't supported in rust-bitcoin, however there isn't anything in that path that requires taking the lock other than the shared code for the other paths.

Results show no regression:

Request Type	master (`cb32328`)	branch
witness-block	22.6s	22.3s
compact-block	60.9s	59.1s
block-transactions	58.5s	59.0s
legacy-block	77.7s	77.4s

However, while running the spam-blocks tool using a very high number of requests so it would not complete, I ran a benchmark with ApacheBench requesting 2000 headers from block 750k 1000 times on a single core using REST.

ab -n 1000 -c 1 "http://127.0.0.1:8332/rest/headers/0000000000000000000592a974b1b9f087cb77628bb4a097d5c2c11b3476a58e.bin?count=2000"

The mean response times for fetching headers while running the spam-blocks tool for each request-type is below:

Request Type	master (`cb32328`)	branch
witness-block	14ms	1ms
compact-block	19ms	1ms
block-transactions	19ms	1ms
legacy-block	42ms	1ms

So it seems like this definitely improves responsiveness for any other threads that need to acquire a lock to cs_main.

furszy

Re-ACK f9d4152.
Since my last review, this PR was only rebased and a comment was added.

sr-gi

Concept ACK

I agree with @furszy in #26326 (comment) that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

sipa · 2024-05-01T20:23:30Z

we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

We can't just change the protocol like that; other software than Bitcoin Core might not expect notfound for block messages. But I do believe it would be useful to have some way of signalling (whether that's notfound or something else), and using the information, that a block is unavailable. It would need a short BIP, and a way to negotiate support for it, but it would mean that over time we may also be able to prefer downloading from peers which commit to supporting such a message, and perhaps have shorted timeouts for non-responses from those.

andrewtoth · 2024-05-01T23:28:54Z

@sr-gi thank you for your review!

that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

After #28120 I don't think it is very likely for this to occur by updated nodes. For older nodes and other software that requests blocks past the prune threshold it would require a protocol change as @sipa mentioned.

furszy · 2024-05-02T00:57:40Z

Little add about the notfound response; I already have a draft for it. I'm busy with other stuff this week but will open the PR next week :).

mzumsande

Concept ACK

mzumsande · 2024-05-02T19:22:06Z

src/net_processing.cpp

-        if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, pindex->GetBlockPos())) {
-            assert(!"cannot load block from disk");
+        if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, block_pos)) {
+            LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n");


As written in #26326 (comment), we never prune blocks > 288 blocks from the tip, and if pruning, we should have disconnected any peer requesting older blocks - so it should be impossible to trigger this log even with pruning (unless there are extreme scenarios like major reorgs).
On the other hand, it could still be hit while not pruning due to disk problems - e.g. while helping other peers with IBD. Searching for this assert gives me multiple matches, e.g.

Assertion `!"cannot load block from disk"' failed #6263

ERROR: ReadBlockFromDisk : Deserialize or I/O error - ReadCompactSize() : size too large #5668

Core dump on bitcoind 0.15.1 with -mempoolexpiry=24 #12230

https://bitcoin.stackexchange.com/questions/113218/how-to-fix-the-bitcoind-service-shutdown-seemingly-by-invalid-block

It apparently also crashed forks like Bitcoin Unlimited back in the days.

I'm not against removing it, but I think that if we do, we should at least have an error message that mentions the possibility of disk failures, and that is also unconditional so it doesn't get hidden in the spammy NET log that very few people enable.

so it should be impossible to trigger this log even with pruning

Since we have a 2 block buffer, peers can request up to 290 blocks. So one way this can be triggered - a peer requests block 290 and we get the filepos, we release the lock, pruneblockchain RPC is called, rpc thread acquires the lock and block 290 is pruned and unlinked before this thread can acquire the fd, read now fails.

I don't think this is really likely to occur though.

In that case, maybe have a different and unconditional error message or even keep the assert just for the case where pruning isn't enabled?

Indeed, I think that's the best way. I did that before #26326 (comment) but other reviewers thought removing the assert is better. But, I think it makes it easier to merge this PR if there is strictly no behavior change, and other patches can be proposed to modify the assertion behavior or disconnecting before returning early. This PR is just to reduce lock scope when doing block reading.

So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.

@maflcko @furszy what do you think?

determine if the block was pruned out from under us

That sounds like work, not sure how easy it is - I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically. I don't have a strong opinion between asserting / logging an error unconditionally, I just think that a conditional NET log is not sufficient in a case where something is seriously wrong on our end.

I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically.

Just in case, it should also check that the block is further than the pruning threshold.

So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.

For now (at least in this PR), we should maintain the current network behavior if the assert is replaced. This means disconnecting the node immediately when it happens (same behavior as if the node crashed for the assertion. The socket gets closed).

Also adds a static assertion that MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP

This also changes behavior if ReadBlockFromDisk or ReadRawBlockFromDisk fails. Previously, the node would crash due to an assert. This has been replaced with logging the error, disconnecting the peer, and returning early.

andrewtoth · 2024-05-04T19:45:47Z

Rebased and updated

Initial whitespace only commit has been removed
Added static_assert(MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP, "MAX_BLOCKTXN_DEPTH too high"); in first commit
In the second commit, if reading the block fails, we retake cs_main and check if the block has been pruned. If so, we conditionally log to NET category, otherwise we unconditionally log an error. In either case, we also disconnect the peer before returning.

Thank you for your reviews @furszy @sr-gi @mzumsande !

sr-gi · 2024-05-06T14:29:14Z

ACK 75d27fe

The only differences are the ones mentioned, which align with the approach after considering #26326 (comment)

furszy

ACK 75d27fe with a non-blocking nit.

furszy · 2024-05-06T19:51:30Z

src/net_processing.cpp

+            } else {
+                LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());
+            }
+            pfrom.fDisconnect = true;


nit: It wouldn't hurt to add a comment above this disconnection (just mention how this behaved historically).
I think that people reading the code for the first time will not understand why the peer is being disconnected for a node local failure.

There is a comment a few lines up that describes the behavior https://github.com/bitcoin/bitcoin/pull/26326/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R2453. But I agree, and will add that same comment here if I need to touch this up again.

mzumsande

Code Review ACK 75d27fe

mzumsande · 2024-05-07T18:14:41Z

src/net_processing.cpp

+            if (WITH_LOCK(m_chainman.GetMutex(), return m_chainman.m_blockman.IsBlockPruned(*pindex))) {
+                LogPrint(BCLog::NET, "Block was pruned before it could be read, disconnect peer=%s\n", pfrom.GetId());
+            } else {
+                LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());


nit: maybe mentioning the block hash and/or height in the unconditional logs here and below could be helpful so in case of a disk failure, users could reproduce the failure easier (e.g. with getblock RPC).

TheCharlatan

ACK 75d27fe

achow101 · 2024-05-08T21:57:35Z

ACK 75d27fe

DrahtBot added the P2P label Oct 17, 2022

andrewtoth force-pushed the remove-read-lock-in-net branch 4 times, most recently from c037473 to fdd4d20 Compare October 17, 2022 17:03

andrewtoth force-pushed the remove-read-lock-in-net branch from fdd4d20 to 8711455 Compare October 19, 2022 00:20

andrewtoth marked this pull request as draft October 19, 2022 00:20

DrahtBot mentioned this pull request Oct 28, 2022

rpc,rest,zmq: faster getblock, NotifyBlock and rest_block by reading raw block #26415

Merged

andrewtoth marked this pull request as ready for review November 18, 2022 16:10

andrewtoth force-pushed the remove-read-lock-in-net branch 4 times, most recently from 20d5fee to 351e587 Compare November 24, 2022 16:57

DrahtBot mentioned this pull request Dec 6, 2022

refactor: Continue moving application data from CNode to Peer #26621

Closed

maflcko reviewed Dec 8, 2022

View reviewed changes

src/net_processing.cpp Outdated Show resolved Hide resolved

src/node/blockstorage.h Outdated Show resolved Hide resolved

fanquake added the Needs Benchmark label Dec 8, 2022

DrahtBot mentioned this pull request Dec 8, 2022

refactor: Use AutoFile and HashVerifier (without ser-type and ser-version) where possible #26649

Merged

hernanmarino reviewed Dec 16, 2022

View reviewed changes

pablomartin4btc reviewed Dec 16, 2022

View reviewed changes

andrewtoth force-pushed the remove-read-lock-in-net branch 2 times, most recently from 135e01f to cb8925b Compare December 19, 2022 14:27

DrahtBot added the CI failed label Jan 13, 2024

andrewtoth force-pushed the remove-read-lock-in-net branch from 1027339 to f9d4152 Compare January 23, 2024 15:24

DrahtBot removed the CI failed label Jan 23, 2024

furszy reviewed Jan 27, 2024

View reviewed changes

DrahtBot requested review from hernanmarino and removed request for hernanmarino January 27, 2024 20:39

DrahtBot mentioned this pull request Mar 13, 2024

scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] #29641

Draft

achow101 requested review from mzumsande and sr-gi and removed request for hernanmarino April 9, 2024 15:34

DrahtBot requested a review from hernanmarino May 1, 2024 18:06

sr-gi reviewed May 1, 2024

View reviewed changes

mzumsande reviewed May 2, 2024

View reviewed changes

andrewtoth added 2 commits May 4, 2024 15:33

net: reduce LOCK(cs_main) scope in GETBLOCKTXN

613a45c

Also adds a static assertion that MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP

net: reduce LOCK(cs_main) scope in ProcessGetBlockData

75d27fe

This also changes behavior if ReadBlockFromDisk or ReadRawBlockFromDisk fails. Previously, the node would crash due to an assert. This has been replaced with logging the error, disconnecting the peer, and returning early.

andrewtoth force-pushed the remove-read-lock-in-net branch from f9d4152 to 75d27fe Compare May 4, 2024 19:42

DrahtBot requested review from furszy and mzumsande May 6, 2024 14:29

furszy reviewed May 6, 2024

View reviewed changes

mzumsande reviewed May 7, 2024

View reviewed changes

TheCharlatan approved these changes May 8, 2024

View reviewed changes

achow101 merged commit 573f631 into bitcoin:master May 8, 2024

andrewtoth deleted the remove-read-lock-in-net branch May 8, 2024 22:05

bitcoin locked and limited conversation to collaborators May 8, 2025

net: don't lock cs_main while reading blocks in net processing #26326

net: don't lock cs_main while reading blocks in net processing #26326

Uh oh!

Conversation

andrewtoth commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage

Reviews

Conflicts

Uh oh!

dergoegge commented Oct 18, 2022

Uh oh!

andrewtoth commented Nov 18, 2022

Uh oh!

Uh oh!

Uh oh!

jamesob commented Dec 8, 2022

Uh oh!

andrewtoth commented Dec 8, 2022

Uh oh!

hernanmarino left a comment

Choose a reason for hiding this comment

Uh oh!

pablomartin4btc left a comment

Choose a reason for hiding this comment

Uh oh!

andrewtoth commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furszy left a comment

Choose a reason for hiding this comment

Uh oh!

sr-gi left a comment

Choose a reason for hiding this comment

Uh oh!

sipa commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtoth commented May 1, 2024

Uh oh!

furszy commented May 2, 2024

Uh oh!

mzumsande left a comment

Choose a reason for hiding this comment

Uh oh!

mzumsande May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtoth May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtoth May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande May 3, 2024

Choose a reason for hiding this comment

Uh oh!

furszy May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtoth commented May 4, 2024

Uh oh!

sr-gi commented May 6, 2024

Uh oh!

furszy left a comment

Choose a reason for hiding this comment

Uh oh!

furszy May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

andrewtoth commented Oct 17, 2022 •

edited

Loading

DrahtBot commented Oct 17, 2022 •

edited

Loading

andrewtoth commented Dec 19, 2022 •

edited

Loading

sipa commented May 1, 2024 •

edited

Loading

mzumsande May 2, 2024 •

edited

Loading

andrewtoth May 2, 2024 •

edited

Loading

mzumsande May 3, 2024 •

edited

Loading

andrewtoth May 3, 2024 •

edited

Loading

furszy May 3, 2024 •

edited

Loading

furszy May 6, 2024 •

edited

Loading