init: change shutdown order of load block thread and scheduler #30435

mzumsande · 2024-07-12T05:21:12Z

This avoids situations during a reindex, in which the shutdown doesn't finish since LimitValidationInterfaceQueue() is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in feature_reindex.py (#30424), which I could locally reproduce with

diff --git a/src/validation.cpp b/src/validation.cpp
index 74f0e4975c..be1706fdaf 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
     AssertLockNotHeld(cs_main);
 
     if (signals.CallbacksPending() > 10) {
+        std::this_thread::sleep_for(std::chrono::milliseconds(50));
         signals.SyncWithValidationInterfaceQueue();
     }
 }

It has also been reported by users running reindex-chainstate (#23234).

I thought for a bit about potential downsides of changing this order, but couldn't find any.

Fixes #30424
Fixes #23234

DrahtBot · 2024-07-12T05:21:15Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	maflcko, hebasto, tdb3, BrandonOdiwuor
Stale ACK	TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#29432 (Stratum v2 Template Provider (take 3) by Sjors)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

TheCharlatan

ACK e427fed

maflcko

lgtm ACK e427fed

But I think the comment can be improved to clarify that this is done in symmetry with the other threads (rpc, p2p, ...)

maflcko · 2024-07-12T06:59:29Z

src/init.cpp

-    // scheduler and load block thread.
-    if (node.scheduler) node.scheduler->stop();
+    // load block thread and the scheduler. Load block is stopped first, since it can
+    // call LimitValidationInterfaceQueue() which would block indefinitely without the scheduler.


I don't think this deadlock is limited to LimitValidationInterfaceQueue, but affects any thread that may call SyncWithValidationInterfaceQueue.

So before the scheduler is stopped, all threads in the context of rpc, p2p, indexes must have been stopped. Also chainmans' restart_indexes and ABC must not be called afterwards. In the code below this line, chainman should only flush the chainstate, which should be fine, as it does not call SyncWithValidationInterfaceQueue.

(IPC must also be stopped, e.g. src/node/interfaces.cpp- void waitForNotificationsIfTipChanged calls SyncWithValidationInterfaceQueue, but I think IPC stuff is unrelated to the changes here and can be done in a follow-up)

True. I changed the wording of the comment / commit message, referring to SyncWithValidationInterfaceQueue instead of LimitValidationInterfaceQueue.

maflcko · 2024-07-12T07:09:19Z

I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

hebasto

ACK e427fed, the change looks correct and it indeed fixes the issue.

This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped.

mzumsande · 2024-07-12T15:53:36Z

e427fed to 5fd4836: reworded comment

I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

yes, this has existed for a long time, #23234 was opened in 2021.

maflcko · 2024-07-12T15:53:57Z

review ACK 5fd4836

furszy

q: what about disallowing the blocking-wait after stopping the scheduler? maybe only on debug mode. e.g. implementing an isActive method in the task runner and calling it prior to creating the promise. It would help us catch these type of errors (if we still have them).

mzumsande · 2024-07-12T20:41:34Z

It would help us catch these type of errors (if we still have them).

Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

hebasto

re-ACK 5fd4836.

tdb3

ACK 5fd4836
Nice work.
Had to use a higher sleep than 50ms to reproduce the error on my local machine (used 150ms).

furszy

It would help us catch these type of errors (if we still have them).

Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

Yes, check if the scheduler is processing events and if not: log something + throw an error. I'm unsure regular users can detect and report which thread hanged here.

An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

I'm not fan of that approach neither. Would prefer to actually do what the function is suppose to be doing (wait for an empty events queue) and actively process events in the caller thread - if we decide to go down this route -:

void ValidationSignals::SyncWithValidationInterfaceQueue()
{
    AssertLockNotHeld(cs_main);
    if (m_internals->m_task_runner->CanProcessEvents()) {
        // Block until the validation queue drains
        std::promise<void> promise;
        CallFunctionInValidationInterfaceQueue([&promise] {
            promise.set_value();
        });
        promise.get_future().wait();
    } else {
        // Process all remaining events in this thread
        FlushBackgroundCallbacks();
    }
}

BrandonOdiwuor

Code Review ACK 5fd4836

This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped. Github-Pull: bitcoin#30435 Rebased-From: 5fd4836

fanquake · 2024-07-17T10:32:19Z

Backported to 27.x in #30467.

4f23c86 [WIP] doc: update release notes for 27.x (fanquake) 54bb9b0 test: add test for modififed walletprocesspsbt calls (willcl-ark) f22b9ca wallet: fix FillPSBT errantly showing as complete (willcl-ark) 05192ba init: change shutdown order of load block thread and scheduler (Martin Zumsande) ab42206 Reapply "test: p2p: check that connecting to ourself leads to disconnect" (Sebastian Falbesoner) 064f214 net: prevent sending messages in `NetEventsInterface::InitializeNode` (Sebastian Falbesoner) 0933cf5 net: fix race condition in self-connect detection (Sebastian Falbesoner) fa90989 psbt: Check non witness utxo outpoint early (Ava Chow) Pull request description: Backports: * #29855 * #30357 * #30394 (modified test commit) * #30435 ACKs for top commit: stickies-v: ACK 4f23c86 willcl-ark: ACK 4f23c86 Tree-SHA512: 5c26445f0855f9d14890369ce19873b0686804eeb659e7d6da36a6f404f64d019436e1e6471579eaa60a96ebf8f64311883b4aef9d0ed528a95bd610c101c079

This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped. Github-Pull: bitcoin#30435 Rebased-From: 5fd4836

…k thread and scheduler 5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande) Pull request description: This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with ```diff diff --git a/src/validation.cpp b/src/validation.cpp index 74f0e49..be1706fdaf 100644 --- a/src/validation.cpp +++ b/src/validation.cpp @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL AssertLockNotHeld(cs_main); if (signals.CallbacksPending() > 10) { + std::this_thread::sleep_for(std::chrono::milliseconds(50)); signals.SyncWithValidationInterfaceQueue(); } } ``` It has also been reported by users running `reindex-chainstate` (#23234). I thought for a bit about potential downsides of changing this order, but couldn't find any. Fixes #30424 Fixes #23234 ACKs for top commit: maflcko: review ACK 5fd4836 hebasto: re-ACK 5fd4836. tdb3: ACK 5fd4836 BrandonOdiwuor: Code Review ACK 5fd4836 Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36

TheCharlatan approved these changes Jul 12, 2024

View reviewed changes

maflcko reviewed Jul 12, 2024

View reviewed changes

hebasto approved these changes Jul 12, 2024

View reviewed changes

hebasto mentioned this pull request Jul 12, 2024

cmake: Regression in feature_reindex.py when configuring with -DSANITIZERS=integer hebasto/bitcoin#261

Closed

fanquake added the Needs backport (27.x) label Jul 12, 2024

init: change shutdown order of load block thread and scheduler

5fd4836

This avoids situations during a reindex in which shutdown doesn't finish since SyncWithValidationInterfaceQueue is called by the load block thread when the scheduler is already stopped.

mzumsande force-pushed the 202407_shutdown_order branch from e427fed to 5fd4836 Compare July 12, 2024 15:49

DrahtBot requested review from hebasto and TheCharlatan July 12, 2024 15:54

furszy reviewed Jul 12, 2024

View reviewed changes

hebasto approved these changes Jul 12, 2024

View reviewed changes

DrahtBot mentioned this pull request Jul 13, 2024

Stratum v2 Template Provider (take 3) #29432

Closed

24 tasks

tdb3 approved these changes Jul 13, 2024

View reviewed changes

hebasto mentioned this pull request Jul 14, 2024

cmake: Regular rebasing of the cmake-staging branch hebasto/bitcoin#264

Closed

furszy reviewed Jul 15, 2024

View reviewed changes

BrandonOdiwuor approved these changes Jul 16, 2024

View reviewed changes

fanquake merged commit 1d24d38 into bitcoin:master Jul 16, 2024

mzumsande deleted the 202407_shutdown_order branch July 16, 2024 17:16

fanquake mentioned this pull request Jul 17, 2024

[27.x] More backports #30467

Merged

fanquake removed the Needs backport (27.x) label Jul 17, 2024

bitcoin locked and limited conversation to collaborators Jul 18, 2025

init: change shutdown order of load block thread and scheduler #30435

init: change shutdown order of load block thread and scheduler #30435

Uh oh!

Conversation

mzumsande commented Jul 12, 2024

Uh oh!

DrahtBot commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage

Reviews

Conflicts

Uh oh!

TheCharlatan left a comment

Choose a reason for hiding this comment

Uh oh!

maflcko left a comment

Choose a reason for hiding this comment

Uh oh!

maflcko Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

maflcko commented Jul 12, 2024

Uh oh!

hebasto left a comment

Choose a reason for hiding this comment

Uh oh!

mzumsande commented Jul 12, 2024

Uh oh!

maflcko commented Jul 12, 2024

Uh oh!

furszy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzumsande commented Jul 12, 2024

Uh oh!

hebasto left a comment

Choose a reason for hiding this comment

Uh oh!

tdb3 left a comment

Choose a reason for hiding this comment

Uh oh!

furszy left a comment

Choose a reason for hiding this comment

Uh oh!

BrandonOdiwuor left a comment

Choose a reason for hiding this comment

Uh oh!

fanquake commented Jul 17, 2024

Uh oh!

Uh oh!

DrahtBot commented Jul 12, 2024 •

edited

Loading

maflcko Jul 12, 2024 •

edited

Loading

furszy left a comment •

edited

Loading