Skip to content

net: don't lock cs_main while reading blocks in net processing #26326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 8, 2024

Conversation

andrewtoth
Copy link
Contributor

@andrewtoth andrewtoth commented Oct 17, 2022

Inspired by #11913 and #26308.

cs_main doesn't need to be locked while reading blocks. This removes the locks in net_processing.

@DrahtBot DrahtBot added the P2P label Oct 17, 2022
@andrewtoth andrewtoth force-pushed the remove-read-lock-in-net branch 4 times, most recently from c037473 to fdd4d20 Compare October 17, 2022 17:03
@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 17, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK sr-gi, furszy, mzumsande, TheCharlatan, achow101
Concept ACK dergoegge, pablomartin4btc
Approach ACK hernanmarino

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@dergoegge
Copy link
Member

Concept ACK

I would suggest that you rebase this PR on top of #26316 (if there is a dependency) and then maybe also mark this PR as a draft until the base is merged.

@andrewtoth andrewtoth force-pushed the remove-read-lock-in-net branch from fdd4d20 to 8711455 Compare October 19, 2022 00:20
@andrewtoth andrewtoth marked this pull request as draft October 19, 2022 00:20
@andrewtoth andrewtoth marked this pull request as ready for review November 18, 2022 16:10
@andrewtoth
Copy link
Contributor Author

I don't think #26316 is necessary anymore. We can simply remove files that are still on disk like in #26533.

@jamesob
Copy link
Contributor

jamesob commented Dec 8, 2022

Cool. Can someone add a "needs benchmark" label?

@andrewtoth
Copy link
Contributor Author

Can someone add a "needs benchmark" label?

What kind of benchmark would be appropriate here? AFAICT message processing is done on a single thread, so this patch wouldn't speed up handling requests. It would free up other threads that are waiting on the lock as the message processing thread does lengthy disk IO.

Copy link
Contributor

@hernanmarino hernanmarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK. Run all tests, everything run fine.

Copy link
Member

@pablomartin4btc pablomartin4btc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cr ACK. I like the idea and I think this change is very useful for performance purposes.
I ran unit tests and functional ones with no errors or warnings.
Pending of benchmarking if I manage to find the way, otherwise when someone posts here I'm happy to re-evaluate.

@andrewtoth
Copy link
Contributor Author

andrewtoth commented Dec 19, 2022

I wrote a tool to allow benchmarking the affected code paths https://github.com/andrewtoth/spam_block_reqs. I ran it on block 768022 using 5000 requests for each of the four paths on both this branch and master. For example:

cargo run --release -- -r block-transactions -b 000000000000000000064b1c6e9606714fd4a96d6ff877e4a93b170d86361e28 -n 5000

I did not benchmark the MSG_FILTERED_BLOCK path because it isn't supported in rust-bitcoin, however there isn't anything in that path that requires taking the lock other than the shared code for the other paths.

Results show no regression:

Request Type master (cb32328) branch
witness-block 22.6s 22.3s
compact-block 60.9s 59.1s
block-transactions 58.5s 59.0s
legacy-block 77.7s 77.4s

However, while running the spam-blocks tool using a very high number of requests so it would not complete, I ran a benchmark with ApacheBench requesting 2000 headers from block 750k 1000 times on a single core using REST.

ab -n 1000 -c 1 "http://127.0.0.1:8332/rest/headers/0000000000000000000592a974b1b9f087cb77628bb4a097d5c2c11b3476a58e.bin?count=2000"

The mean response times for fetching headers while running the spam-blocks tool for each request-type is below:

Request Type master (cb32328) branch
witness-block 14ms 1ms
compact-block 19ms 1ms
block-transactions 19ms 1ms
legacy-block 42ms 1ms

So it seems like this definitely improves responsiveness for any other threads that need to acquire a lock to cs_main.

@andrewtoth andrewtoth force-pushed the remove-read-lock-in-net branch 2 times, most recently from 135e01f to cb8925b Compare December 19, 2022 14:27
Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-ACK f9d4152.
Since my last review, this PR was only rebased and a comment was added.

@DrahtBot DrahtBot requested review from hernanmarino and removed request for hernanmarino January 27, 2024 20:39
@achow101 achow101 requested review from mzumsande and sr-gi and removed request for hernanmarino April 9, 2024 15:34
@DrahtBot DrahtBot requested a review from hernanmarino May 1, 2024 18:06
Copy link
Member

@sr-gi sr-gi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

I agree with @furszy in #26326 (comment) that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

@sipa
Copy link
Member

sipa commented May 1, 2024

we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

We can't just change the protocol like that; other software than Bitcoin Core might not expect notfound for block messages. But I do believe it would be useful to have some way of signalling (whether that's notfound or something else), and using the information, that a block is unavailable. It would need a short BIP, and a way to negotiate support for it, but it would mean that over time we may also be able to prefer downloading from peers which commit to supporting such a message, and perhaps have shorted timeouts for non-responses from those.

@andrewtoth
Copy link
Contributor Author

@sr-gi thank you for your review!

that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide

After #28120 I don't think it is very likely for this to occur by updated nodes. For older nodes and other software that requests blocks past the prune threshold it would require a protocol change as @sipa mentioned.

@furszy
Copy link
Member

furszy commented May 2, 2024

Little add about the notfound response; I already have a draft for it. I'm busy with other stuff this week but will open the PR next week :).

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, pindex->GetBlockPos())) {
assert(!"cannot load block from disk");
if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, block_pos)) {
LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n");
Copy link
Contributor

@mzumsande mzumsande May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in #26326 (comment), we never prune blocks > 288 blocks from the tip, and if pruning, we should have disconnected any peer requesting older blocks - so it should be impossible to trigger this log even with pruning (unless there are extreme scenarios like major reorgs).
On the other hand, it could still be hit while not pruning due to disk problems - e.g. while helping other peers with IBD. Searching for this assert gives me multiple matches, e.g.

It apparently also crashed forks like Bitcoin Unlimited back in the days.

I'm not against removing it, but I think that if we do, we should at least have an error message that mentions the possibility of disk failures, and that is also unconditional so it doesn't get hidden in the spammy NET log that very few people enable.

Copy link
Contributor Author

@andrewtoth andrewtoth May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it should be impossible to trigger this log even with pruning

Since we have a 2 block buffer, peers can request up to 290 blocks. So one way this can be triggered - a peer requests block 290 and we get the filepos, we release the lock, pruneblockchain RPC is called, rpc thread acquires the lock and block 290 is pruned and unlinked before this thread can acquire the fd, read now fails.

I don't think this is really likely to occur though.

Copy link
Contributor

@mzumsande mzumsande May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, maybe have a different and unconditional error message or even keep the assert just for the case where pruning isn't enabled?

Copy link
Contributor Author

@andrewtoth andrewtoth May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I think that's the best way. I did that before #26326 (comment) but other reviewers thought removing the assert is better. But, I think it makes it easier to merge this PR if there is strictly no behavior change, and other patches can be proposed to modify the assertion behavior or disconnecting before returning early. This PR is just to reduce lock scope when doing block reading.

So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.

@maflcko @furszy what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determine if the block was pruned out from under us

That sounds like work, not sure how easy it is - I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically. I don't have a strong opinion between asserting / logging an error unconditionally, I just think that a conditional NET log is not sufficient in a case where something is seriously wrong on our end.

Copy link
Member

@furszy furszy May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically.

Just in case, it should also check that the block is further than the pruning threshold.

So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.

For now (at least in this PR), we should maintain the current network behavior if the assert is replaced. This means disconnecting the node immediately when it happens (same behavior as if the node crashed for the assertion. The socket gets closed).

andrewtoth added 2 commits May 4, 2024 15:33
Also adds a static assertion that MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP
This also changes behavior if ReadBlockFromDisk or
ReadRawBlockFromDisk fails. Previously, the node would crash
due to an assert. This has been replaced with logging the error,
disconnecting the peer, and returning early.
@andrewtoth andrewtoth force-pushed the remove-read-lock-in-net branch from f9d4152 to 75d27fe Compare May 4, 2024 19:42
@andrewtoth
Copy link
Contributor Author

Rebased and updated

  • Initial whitespace only commit has been removed
  • Added static_assert(MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP, "MAX_BLOCKTXN_DEPTH too high"); in first commit
  • In the second commit, if reading the block fails, we retake cs_main and check if the block has been pruned. If so, we conditionally log to NET category, otherwise we unconditionally log an error. In either case, we also disconnect the peer before returning.

Thank you for your reviews @furszy @sr-gi @mzumsande !

@sr-gi
Copy link
Member

sr-gi commented May 6, 2024

ACK 75d27fe

The only differences are the ones mentioned, which align with the approach after considering #26326 (comment)

@DrahtBot DrahtBot requested review from furszy and mzumsande May 6, 2024 14:29
Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 75d27fe with a non-blocking nit.

} else {
LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());
}
pfrom.fDisconnect = true;
Copy link
Member

@furszy furszy May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It wouldn't hurt to add a comment above this disconnection (just mention how this behaved historically).
I think that people reading the code for the first time will not understand why the peer is being disconnected for a node local failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment a few lines up that describes the behavior https://github.com/bitcoin/bitcoin/pull/26326/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R2453. But I agree, and will add that same comment here if I need to touch this up again.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK 75d27fe

if (WITH_LOCK(m_chainman.GetMutex(), return m_chainman.m_blockman.IsBlockPruned(*pindex))) {
LogPrint(BCLog::NET, "Block was pruned before it could be read, disconnect peer=%s\n", pfrom.GetId());
} else {
LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe mentioning the block hash and/or height in the unconditional logs here and below could be helpful so in case of a disk failure, users could reproduce the failure easier (e.g. with getblock RPC).

Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 75d27fe

@achow101
Copy link
Member

achow101 commented May 8, 2024

ACK 75d27fe

@achow101 achow101 merged commit 573f631 into bitcoin:master May 8, 2024
@andrewtoth andrewtoth deleted the remove-read-lock-in-net branch May 8, 2024 22:05
@bitcoin bitcoin locked and limited conversation to collaborators May 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.