-
Notifications
You must be signed in to change notification settings - Fork 37.7k
net: don't lock cs_main while reading blocks in net processing #26326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c037473
to
fdd4d20
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK I would suggest that you rebase this PR on top of #26316 (if there is a dependency) and then maybe also mark this PR as a draft until the base is merged. |
fdd4d20
to
8711455
Compare
20d5fee
to
351e587
Compare
Cool. Can someone add a "needs benchmark" label? |
What kind of benchmark would be appropriate here? AFAICT message processing is done on a single thread, so this patch wouldn't speed up handling requests. It would free up other threads that are waiting on the lock as the message processing thread does lengthy disk IO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK. Run all tests, everything run fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cr ACK. I like the idea and I think this change is very useful for performance purposes.
I ran unit tests and functional ones with no errors or warnings.
Pending of benchmarking if I manage to find the way, otherwise when someone posts here I'm happy to re-evaluate.
I wrote a tool to allow benchmarking the affected code paths https://github.com/andrewtoth/spam_block_reqs. I ran it on block 768022 using 5000 requests for each of the four paths on both this branch and master. For example:
I did not benchmark the Results show no regression:
However, while running the spam-blocks tool using a very high number of requests so it would not complete, I ran a benchmark with ApacheBench requesting 2000 headers from block 750k 1000 times on a single core using REST.
The mean response times for fetching headers while running the spam-blocks tool for each request-type is below:
So it seems like this definitely improves responsiveness for any other threads that need to acquire a lock to |
135e01f
to
cb8925b
Compare
1027339
to
f9d4152
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-ACK f9d4152.
Since my last review, this PR was only rebased and a comment was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
I agree with @furszy in #26326 (comment) that we should followup this with sending NotFounds for all the blocks that we are unable (or unwilling) to provide
We can't just change the protocol like that; other software than Bitcoin Core might not expect |
@sr-gi thank you for your review!
After #28120 I don't think it is very likely for this to occur by updated nodes. For older nodes and other software that requests blocks past the prune threshold it would require a protocol change as @sipa mentioned. |
Little add about the notfound response; I already have a draft for it. I'm busy with other stuff this week but will open the PR next week :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
src/net_processing.cpp
Outdated
if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, pindex->GetBlockPos())) { | ||
assert(!"cannot load block from disk"); | ||
if (!m_chainman.m_blockman.ReadRawBlockFromDisk(block_data, block_pos)) { | ||
LogPrint(BCLog::NET, "Cannot load block from disk. It was likely pruned before we could read it.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in #26326 (comment), we never prune blocks > 288 blocks from the tip, and if pruning, we should have disconnected any peer requesting older blocks - so it should be impossible to trigger this log even with pruning (unless there are extreme scenarios like major reorgs).
On the other hand, it could still be hit while not pruning due to disk problems - e.g. while helping other peers with IBD. Searching for this assert gives me multiple matches, e.g.
- Assertion `!"cannot load block from disk"' failed #6263
- ERROR: ReadBlockFromDisk : Deserialize or I/O error - ReadCompactSize() : size too large #5668
- Core dump on bitcoind 0.15.1 with -mempoolexpiry=24 #12230
- https://bitcoin.stackexchange.com/questions/113218/how-to-fix-the-bitcoind-service-shutdown-seemingly-by-invalid-block
It apparently also crashed forks like Bitcoin Unlimited back in the days.
I'm not against removing it, but I think that if we do, we should at least have an error message that mentions the possibility of disk failures, and that is also unconditional so it doesn't get hidden in the spammy NET log that very few people enable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it should be impossible to trigger this log even with pruning
Since we have a 2 block buffer, peers can request up to 290 blocks. So one way this can be triggered - a peer requests block 290 and we get the filepos, we release the lock, pruneblockchain
RPC is called, rpc thread acquires the lock and block 290 is pruned and unlinked before this thread can acquire the fd, read now fails.
I don't think this is really likely to occur though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, maybe have a different and unconditional error message or even keep the assert just for the case where pruning isn't enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I think that's the best way. I did that before #26326 (comment) but other reviewers thought removing the assert is better. But, I think it makes it easier to merge this PR if there is strictly no behavior change, and other patches can be proposed to modify the assertion behavior or disconnecting before returning early. This PR is just to reduce lock scope when doing block reading.
So, if reading the block fails, we retake cs_main
and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determine if the block was pruned out from under us
That sounds like work, not sure how easy it is - I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically. I don't have a strong opinion between asserting / logging an error unconditionally, I just think that a conditional NET log is not sufficient in a case where something is seriously wrong on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant to check whether we're running in pruning mode, so the block could have been pruned theoretically.
Just in case, it should also check that the block is further than the pruning threshold.
So, if reading the block fails, we retake cs_main and determine if the block was pruned out from under us. If so, we log and return early. Otherwise, we assert as before.
For now (at least in this PR), we should maintain the current network behavior if the assert is replaced. This means disconnecting the node immediately when it happens (same behavior as if the node crashed for the assertion. The socket gets closed).
Also adds a static assertion that MAX_BLOCKTXN_DEPTH <= MIN_BLOCKS_TO_KEEP
This also changes behavior if ReadBlockFromDisk or ReadRawBlockFromDisk fails. Previously, the node would crash due to an assert. This has been replaced with logging the error, disconnecting the peer, and returning early.
f9d4152
to
75d27fe
Compare
Rebased and updated
Thank you for your reviews @furszy @sr-gi @mzumsande ! |
ACK 75d27fe The only differences are the ones mentioned, which align with the approach after considering #26326 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 75d27fe with a non-blocking nit.
} else { | ||
LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId()); | ||
} | ||
pfrom.fDisconnect = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It wouldn't hurt to add a comment above this disconnection (just mention how this behaved historically).
I think that people reading the code for the first time will not understand why the peer is being disconnected for a node local failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a comment a few lines up that describes the behavior https://github.com/bitcoin/bitcoin/pull/26326/files#diff-6875de769e90cec84d2e8a9c1b962cdbcda44d870d42e4215827e599e11e90e3R2453. But I agree, and will add that same comment here if I need to touch this up again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK 75d27fe
if (WITH_LOCK(m_chainman.GetMutex(), return m_chainman.m_blockman.IsBlockPruned(*pindex))) { | ||
LogPrint(BCLog::NET, "Block was pruned before it could be read, disconnect peer=%s\n", pfrom.GetId()); | ||
} else { | ||
LogError("Cannot load block from disk, disconnect peer=%d\n", pfrom.GetId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe mentioning the block hash and/or height in the unconditional logs here and below could be helpful so in case of a disk failure, users could reproduce the failure easier (e.g. with getblock RPC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 75d27fe
ACK 75d27fe |
Inspired by #11913 and #26308.
cs_main
doesn't need to be locked while reading blocks. This removes the locks innet_processing
.