Skip to content

Conversation

mzumsande
Copy link
Contributor

This avoids situations during a reindex, in which the shutdown doesn't finish since LimitValidationInterfaceQueue() is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in feature_reindex.py (#30424), which I could locally reproduce with

diff --git a/src/validation.cpp b/src/validation.cpp
index 74f0e4975c..be1706fdaf 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
     AssertLockNotHeld(cs_main);
 
     if (signals.CallbacksPending() > 10) {
+        std::this_thread::sleep_for(std::chrono::milliseconds(50));
         signals.SyncWithValidationInterfaceQueue();
     }
 }

It has also been reported by users running reindex-chainstate (#23234).

I thought for a bit about potential downsides of changing this order, but couldn't find any.

Fixes #30424
Fixes #23234

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 12, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK maflcko, hebasto, tdb3, BrandonOdiwuor
Stale ACK TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29432 (Stratum v2 Template Provider (take 3) by Sjors)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK e427fed

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ACK e427fed

But I think the comment can be improved to clarify that this is done in symmetry with the other threads (rpc, p2p, ...)

src/init.cpp Outdated
// scheduler and load block thread.
if (node.scheduler) node.scheduler->stop();
// load block thread and the scheduler. Load block is stopped first, since it can
// call LimitValidationInterfaceQueue() which would block indefinitely without the scheduler.
Copy link
Member

@maflcko maflcko Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this deadlock is limited to LimitValidationInterfaceQueue, but affects any thread that may call SyncWithValidationInterfaceQueue.

So before the scheduler is stopped, all threads in the context of rpc, p2p, indexes must have been stopped. Also chainmans' restart_indexes and ABC must not be called afterwards. In the code below this line, chainman should only flush the chainstate, which should be fine, as it does not call SyncWithValidationInterfaceQueue.

(IPC must also be stopped, e.g. src/node/interfaces.cpp- void waitForNotificationsIfTipChanged calls SyncWithValidationInterfaceQueue, but I think IPC stuff is unrelated to the changes here and can be done in a follow-up)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I changed the wording of the comment / commit message, referring to SyncWithValidationInterfaceQueue instead of LimitValidationInterfaceQueue.

@maflcko
Copy link
Member

maflcko commented Jul 12, 2024

I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK e427fed, the change looks correct and it indeed fixes the issue.

This avoids situations during a reindex in which shutdown
doesn't finish since SyncWithValidationInterfaceQueue is
called by the load block thread when the scheduler is already stopped.
@mzumsande mzumsande force-pushed the 202407_shutdown_order branch from e427fed to 5fd4836 Compare July 12, 2024 15:49
@mzumsande
Copy link
Contributor Author

e427fed to 5fd4836: reworded comment

I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

yes, this has existed for a long time, #23234 was opened in 2021.

@maflcko
Copy link
Member

maflcko commented Jul 12, 2024

review ACK 5fd4836

@DrahtBot DrahtBot requested review from hebasto and TheCharlatan July 12, 2024 15:54
Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: what about disallowing the blocking-wait after stopping the scheduler? maybe only on debug mode. e.g. implementing an isActive method in the task runner and calling it prior to creating the promise. It would help us catch these type of errors (if we still have them).

@mzumsande
Copy link
Contributor Author

It would help us catch these type of errors (if we still have them).

Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 5fd4836.

Copy link
Contributor

@tdb3 tdb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 5fd4836
Nice work.
Had to use a higher sleep than 50ms to reproduce the error on my local machine (used 150ms).

Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help us catch these type of errors (if we still have them).

Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

Yes, check if the scheduler is processing events and if not: log something + throw an error. I'm unsure regular users can detect and report which thread hanged here.

An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

I'm not fan of that approach neither. Would prefer to actually do what the function is suppose to be doing (wait for an empty events queue) and actively process events in the caller thread - if we decide to go down this route -:

void ValidationSignals::SyncWithValidationInterfaceQueue()
{
    AssertLockNotHeld(cs_main);
    if (m_internals->m_task_runner->CanProcessEvents()) {
        // Block until the validation queue drains
        std::promise<void> promise;
        CallFunctionInValidationInterfaceQueue([&promise] {
            promise.set_value();
        });
        promise.get_future().wait();
    } else {
        // Process all remaining events in this thread
        FlushBackgroundCallbacks();
    }
}

Copy link
Contributor

@BrandonOdiwuor BrandonOdiwuor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK 5fd4836

@fanquake fanquake merged commit 1d24d38 into bitcoin:master Jul 16, 2024
@mzumsande mzumsande deleted the 202407_shutdown_order branch July 16, 2024 17:16
fanquake pushed a commit to fanquake/bitcoin that referenced this pull request Jul 17, 2024
This avoids situations during a reindex in which shutdown
doesn't finish since SyncWithValidationInterfaceQueue is
called by the load block thread when the scheduler is already stopped.

Github-Pull: bitcoin#30435
Rebased-From: 5fd4836
fanquake pushed a commit to fanquake/bitcoin that referenced this pull request Jul 17, 2024
This avoids situations during a reindex in which shutdown
doesn't finish since SyncWithValidationInterfaceQueue is
called by the load block thread when the scheduler is already stopped.

Github-Pull: bitcoin#30435
Rebased-From: 5fd4836
@fanquake fanquake mentioned this pull request Jul 17, 2024
@fanquake
Copy link
Member

Backported to 27.x in #30467.

fanquake added a commit that referenced this pull request Jul 24, 2024
4f23c86 [WIP] doc: update release notes for 27.x (fanquake)
54bb9b0 test: add test for modififed walletprocesspsbt calls (willcl-ark)
f22b9ca wallet: fix FillPSBT errantly showing as complete (willcl-ark)
05192ba init: change shutdown order of load block thread and scheduler (Martin Zumsande)
ab42206 Reapply "test: p2p: check that connecting to ourself leads to disconnect" (Sebastian Falbesoner)
064f214 net: prevent sending messages in `NetEventsInterface::InitializeNode` (Sebastian Falbesoner)
0933cf5 net: fix race condition in self-connect detection (Sebastian Falbesoner)
fa90989 psbt: Check non witness utxo outpoint early (Ava Chow)

Pull request description:

  Backports:
  * #29855
  * #30357
  * #30394 (modified test commit)
  * #30435

ACKs for top commit:
  stickies-v:
    ACK 4f23c86
  willcl-ark:
    ACK 4f23c86

Tree-SHA512: 5c26445f0855f9d14890369ce19873b0686804eeb659e7d6da36a6f404f64d019436e1e6471579eaa60a96ebf8f64311883b4aef9d0ed528a95bd610c101c079
luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this pull request Aug 1, 2024
This avoids situations during a reindex in which shutdown
doesn't finish since SyncWithValidationInterfaceQueue is
called by the load block thread when the scheduler is already stopped.

Github-Pull: bitcoin#30435
Rebased-From: 5fd4836
tomt1664 pushed a commit to tomt1664/elements that referenced this pull request Apr 1, 2025
…k thread and scheduler

5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande)

Pull request description:

  This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with
  ```diff
  diff --git a/src/validation.cpp b/src/validation.cpp
  index 74f0e49..be1706fdaf 100644
  --- a/src/validation.cpp
  +++ b/src/validation.cpp
  @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
       AssertLockNotHeld(cs_main);

       if (signals.CallbacksPending() > 10) {
  +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
           signals.SyncWithValidationInterfaceQueue();
       }
   }
  ```
  It has also been reported by users running `reindex-chainstate` (#23234).

  I thought for a bit about potential downsides of changing this order, but couldn't find any.

  Fixes #30424
  Fixes #23234

ACKs for top commit:
  maflcko:
    review ACK 5fd4836
  hebasto:
    re-ACK 5fd4836.
  tdb3:
    ACK 5fd4836
  BrandonOdiwuor:
    Code Review ACK 5fd4836

Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
tomt1664 pushed a commit to tomt1664/elements that referenced this pull request Apr 21, 2025
…k thread and scheduler

5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande)

Pull request description:

  This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with
  ```diff
  diff --git a/src/validation.cpp b/src/validation.cpp
  index 74f0e49..be1706fdaf 100644
  --- a/src/validation.cpp
  +++ b/src/validation.cpp
  @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
       AssertLockNotHeld(cs_main);

       if (signals.CallbacksPending() > 10) {
  +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
           signals.SyncWithValidationInterfaceQueue();
       }
   }
  ```
  It has also been reported by users running `reindex-chainstate` (#23234).

  I thought for a bit about potential downsides of changing this order, but couldn't find any.

  Fixes #30424
  Fixes #23234

ACKs for top commit:
  maflcko:
    review ACK 5fd4836
  hebasto:
    re-ACK 5fd4836.
  tdb3:
    ACK 5fd4836
  BrandonOdiwuor:
    Code Review ACK 5fd4836

Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
tomt1664 pushed a commit to tomt1664/elements that referenced this pull request Apr 29, 2025
…k thread and scheduler

5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande)

Pull request description:

  This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with
  ```diff
  diff --git a/src/validation.cpp b/src/validation.cpp
  index 74f0e49..be1706fdaf 100644
  --- a/src/validation.cpp
  +++ b/src/validation.cpp
  @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
       AssertLockNotHeld(cs_main);

       if (signals.CallbacksPending() > 10) {
  +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
           signals.SyncWithValidationInterfaceQueue();
       }
   }
  ```
  It has also been reported by users running `reindex-chainstate` (#23234).

  I thought for a bit about potential downsides of changing this order, but couldn't find any.

  Fixes #30424
  Fixes #23234

ACKs for top commit:
  maflcko:
    review ACK 5fd4836
  hebasto:
    re-ACK 5fd4836.
  tdb3:
    ACK 5fd4836
  BrandonOdiwuor:
    Code Review ACK 5fd4836

Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
tomt1664 pushed a commit to tomt1664/elements that referenced this pull request Jun 16, 2025
…k thread and scheduler

5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande)

Pull request description:

  This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with
  ```diff
  diff --git a/src/validation.cpp b/src/validation.cpp
  index 74f0e49..be1706fdaf 100644
  --- a/src/validation.cpp
  +++ b/src/validation.cpp
  @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
       AssertLockNotHeld(cs_main);

       if (signals.CallbacksPending() > 10) {
  +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
           signals.SyncWithValidationInterfaceQueue();
       }
   }
  ```
  It has also been reported by users running `reindex-chainstate` (#23234).

  I thought for a bit about potential downsides of changing this order, but couldn't find any.

  Fixes #30424
  Fixes #23234

ACKs for top commit:
  maflcko:
    review ACK 5fd4836
  hebasto:
    re-ACK 5fd4836.
  tdb3:
    ACK 5fd4836
  BrandonOdiwuor:
    Code Review ACK 5fd4836

Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
tomt1664 pushed a commit to tomt1664/elements that referenced this pull request Jun 24, 2025
…k thread and scheduler

5fd4836 init: change shutdown order of load block thread and scheduler (Martin Zumsande)

Pull request description:

  This avoids situations during a reindex, in which the shutdown doesn't finish since `LimitValidationInterfaceQueue()` is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in `feature_reindex.py` (#30424), which I could locally reproduce with
  ```diff
  diff --git a/src/validation.cpp b/src/validation.cpp
  index 74f0e49..be1706fdaf 100644
  --- a/src/validation.cpp
  +++ b/src/validation.cpp
  @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
       AssertLockNotHeld(cs_main);

       if (signals.CallbacksPending() > 10) {
  +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
           signals.SyncWithValidationInterfaceQueue();
       }
   }
  ```
  It has also been reported by users running `reindex-chainstate` (#23234).

  I thought for a bit about potential downsides of changing this order, but couldn't find any.

  Fixes #30424
  Fixes #23234

ACKs for top commit:
  maflcko:
    review ACK 5fd4836
  hebasto:
    re-ACK 5fd4836.
  tdb3:
    ACK 5fd4836
  BrandonOdiwuor:
    Code Review ACK 5fd4836

Tree-SHA512: 3b8894e99551c5d4392b55eaa718eee05841a7287aeef2978699e1d633d5234399fa2f5a3e71eac1508d97845906bd33e0e63e5351855139e7be04c421359b36
@bitcoin bitcoin locked and limited conversation to collaborators Jul 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants