-
Notifications
You must be signed in to change notification settings - Fork 37.7k
init: change shutdown order of load block thread and scheduler #30435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK e427fed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ACK e427fed
But I think the comment can be improved to clarify that this is done in symmetry with the other threads (rpc, p2p, ...)
src/init.cpp
Outdated
// scheduler and load block thread. | ||
if (node.scheduler) node.scheduler->stop(); | ||
// load block thread and the scheduler. Load block is stopped first, since it can | ||
// call LimitValidationInterfaceQueue() which would block indefinitely without the scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this deadlock is limited to LimitValidationInterfaceQueue
, but affects any thread that may call SyncWithValidationInterfaceQueue
.
So before the scheduler is stopped, all threads in the context of rpc, p2p, indexes must have been stopped. Also chainmans' restart_indexes
and ABC
must not be called afterwards. In the code below this line, chainman should only flush the chainstate, which should be fine, as it does not call SyncWithValidationInterfaceQueue
.
(IPC must also be stopped, e.g. src/node/interfaces.cpp- void waitForNotificationsIfTipChanged
calls SyncWithValidationInterfaceQueue
, but I think IPC stuff is unrelated to the changes here and can be done in a follow-up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I changed the wording of the comment / commit message, referring to SyncWithValidationInterfaceQueue
instead of LimitValidationInterfaceQueue
.
I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped.
e427fed
to
5fd4836
Compare
e427fed to 5fd4836: reworded comment
yes, this has existed for a long time, #23234 was opened in 2021. |
review ACK 5fd4836 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: what about disallowing the blocking-wait after stopping the scheduler? maybe only on debug mode. e.g. implementing an isActive
method in the task runner and calling it prior to creating the promise. It would help us catch these type of errors (if we still have them).
Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests. An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 5fd4836.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 5fd4836
Nice work.
Had to use a higher sleep than 50ms to reproduce the error on my local machine (used 150ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help us catch these type of errors (if we still have them).
Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.
Yes, check if the scheduler is processing events and if not: log something + throw an error. I'm unsure regular users can detect and report which thread hanged here.
An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later
FlushBackgroundCallbacks()
call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure thatSyncWithValidationInterfaceQueue
always does what it's supposed to do.
I'm not fan of that approach neither. Would prefer to actually do what the function is suppose to be doing (wait for an empty events queue) and actively process events in the caller thread - if we decide to go down this route -:
void ValidationSignals::SyncWithValidationInterfaceQueue()
{
AssertLockNotHeld(cs_main);
if (m_internals->m_task_runner->CanProcessEvents()) {
// Block until the validation queue drains
std::promise<void> promise;
CallFunctionInValidationInterfaceQueue([&promise] {
promise.set_value();
});
promise.get_future().wait();
} else {
// Process all remaining events in this thread
FlushBackgroundCallbacks();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK 5fd4836
This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped. Github-Pull: bitcoin#30435 Rebased-From: 5fd4836
This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped. Github-Pull: bitcoin#30435 Rebased-From: 5fd4836
Backported to 27.x in #30467. |
4f23c86 [WIP] doc: update release notes for 27.x (fanquake) 54bb9b0 test: add test for modififed walletprocesspsbt calls (willcl-ark) f22b9ca wallet: fix FillPSBT errantly showing as complete (willcl-ark) 05192ba init: change shutdown order of load block thread and scheduler (Martin Zumsande) ab42206 Reapply "test: p2p: check that connecting to ourself leads to disconnect" (Sebastian Falbesoner) 064f214 net: prevent sending messages in `NetEventsInterface::InitializeNode` (Sebastian Falbesoner) 0933cf5 net: fix race condition in self-connect detection (Sebastian Falbesoner) fa90989 psbt: Check non witness utxo outpoint early (Ava Chow) Pull request description: Backports: * #29855 * #30357 * #30394 (modified test commit) * #30435 ACKs for top commit: stickies-v: ACK 4f23c86 willcl-ark: ACK 4f23c86 Tree-SHA512: 5c26445f0855f9d14890369ce19873b0686804eeb659e7d6da36a6f404f64d019436e1e6471579eaa60a96ebf8f64311883b4aef9d0ed528a95bd610c101c079
This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped. Github-Pull: bitcoin#30435 Rebased-From: 5fd4836
…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
This avoids situations during a reindex, in which the shutdown doesn't finish since
LimitValidationInterfaceQueue()
is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures infeature_reindex.py
(#30424), which I could locally reproduce withIt has also been reported by users running
reindex-chainstate
(#23234).I thought for a bit about potential downsides of changing this order, but couldn't find any.
Fixes #30424
Fixes #23234