-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Call ProcessNewBlock() asynchronously #16175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call ProcessNewBlock() asynchronously #16175
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
3e8ad42
to
5779633
Compare
Oops, it turns out its actually not possible to do #16174 first separately (and without it, this PR is entirely useless). By itself, #16174 ends up hitting the deadly cross-validationinterface-queue deadlock - it takes cs_peerstate first, then (eventually) calls PNB, which may call LimitValidationInterfaceQueue(). Meanwhile, in the validation queue, we may end up waiting on cs_peerstate to update some peer's state. Of course, once we actually move block validation into its own thread, this is no longer (materially) a concern - we trivially guarantee there are no locks held going into LimitValidationInterfaceQueue(). |
a36f301
to
4eb8ebf
Compare
4eb8ebf
to
d31e3d0
Compare
Note that there is still a regression causing some functional tests to time out as we may end up waiting 100ms to discover that a block has been processed and we can process the next message from that peer (this shouldn't be an issue in the case of downloading from multiple peers since we'll always wake on a new tip block, and seems to not be an issue for one peer, even though its possible it would be due to a wake race). Not 100% sure what the best solution to that is. |
For other reviewers: the particular validationinterface callback at fault here looks to be Big Concept ACK on the lock-splitting at the very least, though I do wish it could be done separately somehow. I'm a Concept-Concept ACK on this (in concept), but want to make sure I fully understand the implications. |
9fdf05d resolved some lock inversion warnings in denialofservice_tests, but left in a number of cs_main locks that are unnecessary (introducing lock inversion warnings in future changes).
This prepares for making ProcessNewBlock processing actually happen in a separate thread from the caller, even though all callsites currently block on the return value immediately. Note that this makes some of the unit tests less restrictive.
CNodeState was added for validation-state-tracking, and thus, logically, was protected by cs_main. However, as it has grown to include non-validation state (taking state from CNode), and as we've reduced cs_main usage for other unrelated things, CNodeState is left with lots of cs_main locking in net_processing. In order to ease transition to something new, this adds only a dummy CPeerState which is held as a reference for the duration of message processing. Note that moving things is somewhat tricky pre validation-thread as a consistent lockorder must be kept - we can't take a lock on the new cs_peerstate in anything that's called directly from validation.
Essentially, our goal is to not process anything for the given peer until the block finishes processing (emulating the previous behavior) without actually blocking the ProcessMessages loops. Obviously, in most cases, we'll just go on to the next peer and immediately hit a cs_main lock, blocking us anyway, but this we can slowly improve that state over time by moving things from CNodeState to CPeerState.
Spawn a background thread at startup which validates each block as it comes in from ProcessNewBlock, taking advantage of the new std::future return value to keep tests simple (and the new net_processing handling of such values async already). This makes introducing subtle validationinterface deadlocks much harder as any locks held going into ProcessNewBlock do not interact with (in the form of lockorder restrictions) locks taken in validationinterface callbacks. Note that after this commit, feature_block and feature_assumevalid tests time out due to increased latency between block processing when those blocks do not represent a new best block. This will be resolved in the next commit.
This resolves the performance regression introduced in the previous commit by always waking the message processing thread after each block future resolves. Sadly, this is somewhat awkward - all other validationinterface callbacks represent an actual change to the global validation state, whereas this callback indicates only that a call which one validation "client" made has completed. After going back and forth for some time I didn't see a materially better way to resolve this issue, and luckily its a rather simple change, but its far from ideal. Note that because we absolutely do not want to ever block on a ProcessNewBlock-returned-future, the callback approach is critical.
This removes the cs_main lock which is taken on every ProcessMessages call.
9727e5e
to
2d1e8f6
Compare
* | ||
* @param[in] pblock The block we want to process. | ||
* @param[in] fForceProcessing Process this block even if unrequested; used for non-network block sources and whitelisted peers. | ||
* @param[out] fNewBlock A boolean which is set to indicate if the block was first received via this call | ||
* @return True if state.IsValid() | ||
* @return A future which complets with a boolean which is set to indicate if the block was first received via this call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an "e" completes the spelling here :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started review (will update list below with progress).
- 9bc8ca6 Remove unnecessary cs_mains in denialofservice_tests (1/10)
- 79e6e44 Make ProcessNewBlock return a future instead of an immediate bool (2/10)
- e898229 Add a new peer state tracking class to reduce cs_main contention. (3/10)
- 8ddbb12 Move net_processing's ProcessNewBlock calls to resolve async. (4/10)
- 64e74ce Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken) (5/10)
- ba810b9 Add a callback to indicate a block has been processed (6/10)
- 3499d98 Move BlockChecked to a background thread (7/10)
- 445bb84 Move nDoS counters to CPeerState (and, thus, out of cs_main) (8/10)
- b6a0552 Move rejects into cs_peerstate. (9/10)
- 2d1e8f6 Actually parallelize validation somewhat (10/10)
src/net_processing.cpp
Outdated
@@ -388,7 +399,20 @@ struct CNodeState { | |||
// Keeps track of the time (in microseconds) when transactions were requested last time | |||
limitedmap<uint256, int64_t> g_already_asked_for GUARDED_BY(cs_main)(MAX_INV_SZ); | |||
|
|||
/** Note that this must be locked BEFORE cs_main! */ | |||
CCriticalSection cs_peerstate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)
Do you want this to be a recursive mutex for some reason? If so, it'd helpful to say in an comment why recursive locking is helpful here. Otherwise I'd suggest changing this line to Mutex g_peerstate_mutex
(also not using the old "critical section" terminology).
src/net_processing.cpp
Outdated
@@ -756,12 +780,22 @@ void PeerLogicValidation::InitializeNode(CNode *pnode) { | |||
LOCK(cs_main); | |||
mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(addr, std::move(addrName), pnode->fInbound, pnode->m_manual_connection)); | |||
} | |||
{ | |||
LOCK(cs_peerstate); | |||
mapPeerState.emplace_hint(mapPeerState.end(), nodeid, CPeerState{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)
Not necessarily suggesting you should change this here, but is the only reason for adding a new mapPeerState
map to make the diff smaller? If entries are always added to both mapNodeState
and mapPeerState
maps and removed from the maps at the same times with both locks held, it would seem more ideal to have just single map where each entry contains separate node state and peer state structs (to support separate lock annotations).
* contention. Thus, new state tracking should go here, and we should eventually | ||
* move most (non-validation-specific) state here. | ||
*/ | ||
struct CPeerState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)
Probably better to follow current coding convention and call it PeerState
instead of CPeerState
src/net_processing.cpp
Outdated
/** Map maintaining per-node state. */ | ||
static std::map<NodeId, CPeerState> mapPeerState GUARDED_BY(cs_peerstate); | ||
|
||
static CPeerState *PeerState(NodeId pnode) EXCLUSIVE_LOCKS_REQUIRED(cs_peerstate) LOCKS_EXCLUDED(cs_main) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)
Would suggest calling this LookupPeerState
, to be consistent with LookupBlockIndex
, and because otherwise calls to this function look like they are constructing objects, not looking up existing objects
/** Map maintaining per-node state. */ | ||
static std::map<NodeId, CPeerState> mapPeerState GUARDED_BY(cs_peerstate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)
g_peer_states
instead of mapPeerState
would be more in line with current coding convention
src/validation.cpp
Outdated
} | ||
} | ||
|
||
result_promise.set_value(fNewBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return result; | ||
} | ||
|
||
std::future<bool> ProcessNewBlock(const CChainParams& chainparams, const std::shared_ptr<const CBlock> pblock, bool fForceProcessing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken)" (64e74ce)
Documentation for this function in validation.h becomes out of date with this commit and should be updated. It still says "This only returns after the best known valid block is made active."
bool ret = CheckBlock(*pblock, state, chainparams.GetConsensus()); | ||
if (ret) { | ||
// Store to disk | ||
ret = ::ChainstateActive().AcceptBlock(pblock, state, chainparams, &pindex, fForceProcessing, nullptr, &fNewBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken)" (64e74ce)
Throughout this function shouldn't ::ChainstateActive()
be replaced by *this
?
if (!new_block && accepted) { | ||
LOCK(cs_main); | ||
const CBlockIndex* pindex = LookupBlockIndex(blockptr->GetHash()); | ||
if (!new_block && pindex && pindex->IsValid()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Make ProcessNewBlock return a future instead of an immediate bool" (79e6e44)
This seems fine, but just to make sure I understand correctly, there is a change in behavior here? Previously accepted
would only be true if CheckBlock
and AcceptBlock
and ActivateBestChain
all succeeded but this can be true even they fail?
@@ -175,7 +175,7 @@ BOOST_FIXTURE_TEST_CASE(blockfilter_index_initial_sync, TestChain100Setup) | |||
uint256 chainA_last_header = last_header; | |||
for (size_t i = 0; i < 2; i++) { | |||
const auto& block = chainA[i]; | |||
BOOST_REQUIRE(ProcessNewBlock(Params(), block, true, nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Make ProcessNewBlock return a future instead of an immediate bool" (79e6e44)
In the commit message about these tests you wrote "Note that this makes some of the unit tests less restrictive." But I don't think there's a good justification for doing this. If you just declared struct ProcessNewBlockResult { bool error = false; bool already_have = false; };
and returned std::future<ProcessNewBlockResult>
, the tests could behave the same and the code would be more readable and less fragile, without the need for people to guess from context whether the bool indicates success or newness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I agree with this suggestion; dropping the fNewBlock
result at the same time as turning the return into a future
seems unnecessarily complicated.
Just want to note that all my comments above are just suggested cleanups. Feel free to ignore them if they don't make sense or aren't worth effort to implement. The only thing I'd really like to see are more comments about locking. When a lock is held in a small scope for specific purpose, I don't think there's a need to have a comment, but when a lock is held over wide scope for no obvious reason it's good to have a comment that suggests why it needs to be acquired there and when it is safe to release. |
Working on rewriting this to make it simpler but it looks like it may end up growing so dunno about it. |
This is essentially a rebase of #12934, though its actually greenfield to reduce diff a bit. Based on #16174 as without it the parallelism would be almost entirely useless. Note that this is draft as I haven't (yet) bothered doing the backgrounding of the block validation, its currently just a return value change and the net_processing changes to handle it.