Skip to content

Conversation

promag
Copy link
Contributor

@promag promag commented Mar 6, 2020

From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list.

This means that UnregisterValidationInterface doesn't prevent more calls to that interface.

This PR fixes the assumption that no validation calls happen after UnregisterValidationInterface in the following code

wallet->m_chain_notifications_handler.reset();
delete wallet;

The actual bug is that delete wallet races with the validation notifications.

The fix consists in synchronizing with the validation queue (which happens in BlockUntilSyncedToCurrentChain) after UnregisterValidationInterface.

Alternative to #18279. Fixes #16307.

@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch 2 times, most recently from 8e8695d to 12374e6 Compare March 6, 2020 15:38
@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch 3 times, most recently from c1ee308 to afe8d39 Compare March 7, 2020 02:35
@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 7, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@ryanofsky
Copy link
Contributor

Alternative to #18279. Similar to #15700. More details in #18065. Possibly fixes #16307.

About to look into this but this PR description seems like it is going to take a lot of digging to understand.

Is it possible to summarize in a paragraph what this change is doing, what problem it fixes, and how it compares to the alternative? Thank you! 🙏

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to help other reviewers, this seems to make three changes:

  1. Instead of RegisterValidationInterface being non-blocking and beginning to send notifications right away, even if they preceded the RegisterValidationInterface call, it will now block waiting for any previously queued notifications to be drained, then attach the notification handler, then finally return.

  2. Instead of UnregisterValidationInterface being non-blocking and turning off notifications right away, it waits for any notifications that were queued prior to the UnregisterValidationInterface to be handled and blocks. Then it disconnects and returns after they were handled.

  3. Stops holding cs_main while sending the loadwallet notification.

Change 3 seems like a good change just because it reduces unnecessary locking, but it's not clear what motivated the change. Maybe it would fix the deadlock @fanquake alluded to in #16307 (comment), but it's not possible to to tell without a backtrace.

Change 1 at first glance seems like it more likely lead to missing notifications on startup, but maybe it's intended to prevent the wallet redundantly processing mempool transactions blocks that have just been rescanned? Unclear

Change 2 seems like it could fix missing notifications on shutdown, but I'm not sure where that's been reported. It also seems like it could slow down shutdown unnecessarily.

Note: I have a change which handles attaching notifications in a more deliberate nonracy way in #15719. Not sure if it overlaps with bugs that are being seen now, though. And it's nice to have more minimal bugfixes anyway.

@@ -3975,14 +3985,6 @@ std::shared_ptr<CWallet> CWallet::CreateWalletFromFile(interfaces::Chain& chain,
// Register with the validation interface. It's ok to do this after rescan since we're still holding locked_chain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer true, and seems like it means there can now be missing notifications. I wonder if a different approach would be to send the loadwallet notification before locking the chain & scanning instead of after?

@promag promag changed the title fix: Disconnect validation interface on the queue dispacher wip: fix: Disconnect validation interface on the queue dispacher Mar 10, 2020
@promag
Copy link
Contributor Author

promag commented Mar 10, 2020

Sorry @ryanofsky, should have tagged wip. Will address your comments and address each required change once this is good. Thanks for reviewing and your comments - will also revisit #15719.

@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch from afe8d39 to 477a4b7 Compare March 10, 2020 16:22
@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch from 477a4b7 to bba30cb Compare March 10, 2020 18:15
@promag promag changed the title wip: fix: Disconnect validation interface on the queue dispacher wallet: UnregisterValidationInterface before SyncWithValidationInterfaceQueue Mar 10, 2020
@promag
Copy link
Contributor Author

promag commented Mar 10, 2020

Updated OP. Reduced the scope to just fix #16307. Will submit other PRs with relevant changes from previous commits.

@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch 5 times, most recently from 56db247 to 2a42fc1 Compare March 11, 2020 14:14
@promag
Copy link
Contributor Author

promag commented Mar 11, 2020

Updated to include dc62f3c which ensures that SyncWithValidationInterfaceQueue doesn't hang due to !AreThreadsServicingQueue.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 2a42fc1. I left one comment below that would be nice to address. Also dc62f3c could use a better commit description. It's unclear when SyncWithValidationInterfaceQueue would hang now. Is there a shutdown race where wallets are outliving the scheduler? Or is this just a test workaround for cases where scheduler is not running?

Comment on lines 150 to 154
void SingleThreadedSchedulerClient::MaybeScheduleProcessQueue() {
if (!m_pscheduler->AreThreadsServicingQueue()) {
EmptyQueue();
return;
}
Copy link
Contributor

@ryanofsky ryanofsky Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "Handle no threads serving in MaybeScheduleProcessQueue" (dc62f3c)

I think it would make more sense to add this logic at the bottom of AddToProcessQueue (and avoid calling MaybeScheduleProcessQueue in no thread case) than having it at the top of MaybeScheduleProcessQueue and returning early.

Few reasons:

  • Having this logic doesn't seem useful in ProcessQueue, the other place where MaybeScheduleProcessQueue is called
  • Adding this code here creates recursion ProcessQueue -> MaybeScheduleProcessQueue -> EmptyQueue -> ProcessQueue
  • If code is added here the name MaybeScheduleProcessQueue is now misleading because this is running callbacks immediately, not scheduling them for later

@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch from 2a42fc1 to c9ca6c4 Compare March 11, 2020 17:12
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK c9ca6c4. Just suggested scheduler changes since last review

@@ -196,6 +196,9 @@ void SingleThreadedSchedulerClient::AddToProcessQueue(std::function<void ()> fun
m_callbacks_pending.emplace_back(std::move(func));
}
MaybeScheduleProcessQueue();
if (!m_pscheduler->AreThreadsServicingQueue()) {
EmptyQueue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "Handle no threads serving in MaybeScheduleProcessQueue" (201eebf)

Commit message subject is out of date since MaybeScheduleProcessQueue is no longer changing.

Also I'm surprised by "SyncWithValidationInterfaceQueue is called after
scheduler thread exists, which happen at shutdown" since I would hope wallets are getting unloaded before the scheduler stops, but maybe this is not the case currently. Probably beyond the scope of this PR to address, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since I would hope wallets are getting unloaded before the scheduler stops

Not currently the case.

Probably beyond the scope of this PR to address, though

Agree, I've tried and it caused some other failures.

Commit message subject is out of date since MaybeScheduleProcessQueue is no longer changing.

Ops need to update.

@@ -196,6 +196,9 @@ void SingleThreadedSchedulerClient::AddToProcessQueue(std::function<void ()> fun
m_callbacks_pending.emplace_back(std::move(func));
}
MaybeScheduleProcessQueue();
if (!m_pscheduler->AreThreadsServicingQueue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "Handle no threads serving in MaybeScheduleProcessQueue" (201eebf)

If we can make this code:

if (m_pscheduler->AreThreadsServicingQueue()) [
    MaybeScheduleProcessQueue();
} else {
    EmptyQueue();
}

it would be more obvious what's going on here.

Otherwise if the unconditional MaybeScheduleProcessQueue call is needed it would be good have a comment saying what it's needed for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be inside EmptyQueue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be inside EmptyQueue

I wouldn't change emptyqueue because it's called other places which don't need this logic, and I can't see how the behavior would fit in there (how you could coherently describe it or change the emptyqueue name to reflect it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. However if the scheduler thread is stopped after AreThreadsServicingQueue then the function won't be called right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test wallet_multiwallet fails due to this code path, see my changes in EmptyQueue, you should schedule process queue when event queue is not empty, the logic is reversed there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes much more sense compare to current version

void SingleThreadedSchedulerClient::EmptyQueue() {
    if (!m_pscheduler->AreThreadsServicingQueue())
        return;
    auto pendingCallbacks = [this]() -> bool {
        LOCK(m_cs_callbacks_pending);
        return !m_callbacks_pending.empty();
    };
    bool should_continue = pendingCallbacks();
    while (should_continue) {
        ProcessQueue();
        should_continue = pendingCallbacks();
    }
}
  1. You don't need to do stupid things, like here one,
    if (!m_pscheduler->AreThreadsServicingQueue())
  2. You don't need to check against is there pending callbacks it's done there, so if you don't have pending ones you don't need to enter eventually endless waiting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iv'e been trying to reproduce appveyor error wihtout any luck.

Copy link
Contributor

@ryanofsky ryanofsky Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You don't need to do stupid things, like here one,
    if (!m_pscheduler->AreThreadsServicingQueue())
  2. You don't need to check against is there pending callbacks it's done there, so if you don't have pending ones you don't need to enter eventually endless waiting.

@bvbfan, I'm confused by what you are saying here. You seem to be saying you don't need two things, but your code sample is doing both those things. Are you saying that for some reason it is better to do (1) in EmptyQueue instead of AddToProcessQueue? If so, what is the reason you think this? To me it seems better to do (1) in AddToProcessQueue instead of EmptyQueue, because AddToProcessQueue has to be aware of thread state while EmptyQueue does not.

I'm even more confused what you are saying about (2) because you're saying we don't need a check for pending callbacks but your code is actually adding an extra check (before the first loop iteration).

More importantly, I can't figure out what this suggestion has to do with wallet_multiwallet.py. If you can clarify, it would be much appreciated. I've also been trying to reproduce the failure there and look for possible causes and haven't been able to figure it out.

Copy link
Contributor

@bvbfan bvbfan Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by what you are saying here. You seem to be saying you don't need two things, but your code sample is doing both those things.

I mean you don't need to do these things outside else, i was not clear. These 2 things should be done here in EmptyQueue

More importantly, I can't figure out what this suggestion has to do with wallet_multiwallet.py

if (m_pscheduler->AreThreadsServicingQueue()) [
    MaybeScheduleProcessQueue();
} else {
    EmptyQueue();
}

This code is wrong, you try to process / flush events but it does not do that. Let analyse current code

void SingleThreadedSchedulerClient::EmptyQueue() {
assert(!m_pscheduler->AreThreadsServicingQueue()); // <- assert (you can avoid that)
   bool should_continue = true;
   while (should_continue) {
       ProcessQueue(); // <--- here is the bigger mistake, it tries to process events before it's checked that has pending callbacks, you can enter this and wait to infinity (since app is shutting down) it should be checked for pendings before process queue
       LOCK(m_cs_callbacks_pending);
       should_continue = !m_callbacks_pending.empty();
   }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

       ProcessQueue(); // <--- here is the bigger mistake, it tries to process events before it's checked that has pending callbacks, you can enter this and wait to infinity (since app is shutting down) it should be checked for pendings before process queue

Explain this more? ProcessQueue doesn't wait for callbacks if there aren't any, it just runs a callbacks if there is one.

Iv'e been trying to reproduce appveyor error wihtout any luck.

I finally reproduced the problem cutting down the test and running in a loop with --usecli. Looking at the stack trace I see unloadwallet RPC thread waiting for scheduler thread MaybeCompactWalletDB call, which is stuck in ReleaseWallet calling SyncWithValidationInterfaceQueue, which is stuck because SyncWithValidationInterfaceQueue will always hang if called from the scheduler thread. This is basically the same deadlock that was in fanquake's backtrace from #16307 (comment), which suggests that 2a42fc1 and 2097e35 aren't going to work as fixes for this problem in their current form. Maybe more can be done to get them working but an alternate approach like 06d2b53 (branch, comment) might be easier at this point

promag added 2 commits March 11, 2020 19:06
This is necessary if SyncWithValidationInterfaceQueue is called after
scheduler thread exists, which happen at shutdown.
@promag promag force-pushed the 2020-03-sync-unregistervalidationinterface branch from c9ca6c4 to 2097e35 Compare March 11, 2020 19:07
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 2097e35. Just suggested change clarifying AddToProcessQueue / MaybeScheduleProcessQueue logic since last review

@@ -195,7 +195,11 @@ void SingleThreadedSchedulerClient::AddToProcessQueue(std::function<void ()> fun
LOCK(m_cs_callbacks_pending);
m_callbacks_pending.emplace_back(std::move(func));
}
MaybeScheduleProcessQueue();
if (m_pscheduler->AreThreadsServicingQueue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "Handle no threads serving in MaybeScheduleProcessQueue" (03cd591)

Was experimenting and wrote a small test for this change:

// Ensure scheduled tasks run even when there are no scheduler threads
BOOST_AUTO_TEST_CASE(singlethreadedscheduler_nothreads)
{
    CScheduler scheduler;
    SingleThreadedSchedulerClient client(&scheduler);
    bool task_ran = false;
    client.AddToProcessQueue([&] { task_ran = true; });
    BOOST_CHECK(task_ran);
}

Could add this to scheduler_tests.cpp. It fails without this commit and passes with it.

@promag
Copy link
Contributor Author

promag commented Mar 13, 2020

Replaced by #18338.

@promag promag closed this Mar 13, 2020
@promag promag deleted the 2020-03-sync-unregistervalidationinterface branch March 13, 2020 08:46
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: crash after releasing wallet
4 participants