Skip to content

Conversation

bvbfan
Copy link
Contributor

@bvbfan bvbfan commented Mar 6, 2020

Fixes #16307

Signed-off-by: Anthony Fieroni bvbfan@abv.bg

@fanquake fanquake added the Wallet label Mar 6, 2020
bool tipIsSame = false;
do {
uint256 last_block_hash = WITH_LOCK(cs_wallet, return m_last_block_processed);
tipIsSame = chain().waitForNotificationsIfTipIsNotSame(last_block_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this whole scope happen under the cs_wallet lock?

Also, shouldn't m_chain_notifications_handler.reset() happen inside here under the wallet lock?

Otherwise a new block could come in and put new notifications in the handler while we reset it.

Copy link
Contributor Author

@bvbfan bvbfan Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this whole scope happen under the cs_wallet lock?

No, we leave mutex for process thread to finish.

Also, shouldn't m_chain_notifications_handler.reset() happen inside here under the wallet lock?

No.

Otherwise a new block could come in and put new notifications in the handler while we reset it.

I'm aware of that, we can have a function that takes a mutex and reset notifications

@promag
Copy link
Contributor

promag commented Mar 6, 2020

I think this is the wrong approach.

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 6, 2020

I think this is the wrong approach.

That's at least ensure sync between wallet and chain tips otherwise all calls to BlockUntilSyncedToCurrentChain are just pointless, they don't do what is supposed to do. Your patch looks fine but not what we want when calls a function like BlockUntilSyncedToCurrentChain

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 7, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot mentioned this pull request Mar 7, 2020
@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 7, 2020

@promag the problem is not the queued connection / disconnection, actually

void UnregisterValidationInterface(CValidationInterface* pwalletIn) {
    if (g_signals.m_internals) {
        g_signals.m_internals->m_connMainSignals.erase(pwalletIn);
    }
}

signal is received before disconnect is called by erasing the caller.

Fix unit test and rebase.

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 7, 2020

Let's clarify a bit more.

void disconnect() override
{
    if (m_notifications) {
        m_notifications = nullptr; // <---------------- it makes no sense to be before unregister
        UnregisterValidationInterface(this);
    }
}

Since ValidationInterfaceConnections calls destructors in reverse order, for a bad luck UpdatedBlockTip is disconnected last which is not held the mutex but update time by atomic variable.
Running in testnet for at least an hour in parallel with

#!/bin/bash
while [ 1 ]
do
    src/bitcoin-cli -testnet loadwallet test
    src/bitcoin-cli -testnet unloadwallet test
done

no crashes anymore.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've hit into #16307 issue while doing refactoring on this part of the code. Current refactoring direction is to move towards the wallet and chain being independent tips, not more synchronized so another way to solve this would be to reset handler correctly while holding wallet lock and then test handler still exists in BlockConnected/BlockDisconnected/..., flushing correctly and committing right locator means worst-case we redo a rescan at wallet loading.

Doing it race-condition free likely need a handler lock though (or `NotificationsHandlerImpl locking wallet...)

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 10, 2020

while holding wallet lock and then test handler still exists in BlockConnected/BlockDisconnected

i don't think that's right approach to me.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18279 (comment)

That's at least ensure sync between wallet and chain tips otherwise all calls to BlockUntilSyncedToCurrentChain are just pointless, they don't do what is supposed to do. Your patch looks fine but not what we want when calls a function like BlockUntilSyncedToCurrentChain

I think there is a misunderstanding here (maybe my own), but my understanding is that the point of BlockUntilSyncedToCurrentChain is just to ensure that if you make two RPC calls in a row the second RPC call is always aware of the result of first call. Ensuring that notification in the queue from the time of the call are processed is sufficient to do this, without trying to guarantee that the wallet tip and node tips are equal.

Checking tips for equality is just an optimization to avoid waiting for the queue, and guaranteeing that the tips are equal isn't actually possible (despite the attempt here) because BlockUntilSyncedToCurrentChain isn't called with any locks held.

@@ -160,8 +160,8 @@ class NotificationsHandlerImpl : public Handler, CValidationInterface
void disconnect() override
{
if (m_notifications) {
m_notifications = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems clearly correct. Good catch! Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

You're right

@@ -346,13 +346,14 @@ class ChainImpl : public Chain
{
return MakeUnique<NotificationsHandlerImpl>(*this, notifications);
}
void waitForNotificationsIfTipChanged(const uint256& old_tip) override
bool waitForNotificationsIfTipIsNotSame(const uint256& tip) override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name change, and I can see how the having the bool return value should make races and the deadlock reported #16307 (comment) less likely, though I suspect there is also probably a more direct and reliable fix we can find

wallet->m_chain_notifications_handler.reset();
WITH_LOCK(wallet->cs_wallet, wallet->Flush());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquiring the cs_wallet lock here is unexpected. If this is really neccessary, I think the lock should be acquired closer to where it's used inside the Flush implementation, probably with a comment if the purpose of the lock isn't more obvious there

Copy link
Contributor Author

@bvbfan bvbfan Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a misunderstanding here (maybe my own), but my understanding is that the point of BlockUntilSyncedToCurrentChain is just to ensure that if you make two RPC calls in a row the second RPC call is always aware of the result of first call. Ensuring that notification in the queue from the time of the call are processed is sufficient to do this, without trying to guarantee that the wallet tip and node tips are equal.

Yes, but mostly after BlockUntilSyncedToCurrentChain lock is acquired and there is no guarantee that notification is faster than this held, i.e. you can leave BlockUntilSyncedToCurrentChain and don't do what you want, you gain mutex before notification thread and you're not sync anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here after BlockUntilSyncedToCurrentChain is ensuring it's on current tip, the lock tries to guarantee that notification thread is finished before we flush and release memory.

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 11, 2020

@promag, @ryanofsky i have another, looking better to me, solution ensuring there is no background tasks, but i've not push since it's not well tested

diff --git a/src/scheduler.cpp b/src/scheduler.cpp
index 7cb7754fd..17c26be74 100644
--- a/src/scheduler.cpp
+++ b/src/scheduler.cpp
@@ -199,12 +199,16 @@ void SingleThreadedSchedulerClient::AddToProcessQueue(std::function<void ()> fun
 }
 
 void SingleThreadedSchedulerClient::EmptyQueue() {
-    assert(!m_pscheduler->AreThreadsServicingQueue());
-    bool should_continue = true;
+    if (!m_pscheduler->AreThreadsServicingQueue())
+        return;
+    auto hasCallbacks = [this]() -> bool {
+        LOCK(m_cs_callbacks_pending);
+        return !m_callbacks_pending.empty();
+    };
+    auto should_continue = hasCallbacks();
     while (should_continue) {
         ProcessQueue();
-        LOCK(m_cs_callbacks_pending);
-        should_continue = !m_callbacks_pending.empty();
+        should_continue = hasCallbacks();
     }
 }
 
diff --git a/src/validationinterface.cpp b/src/validationinterface.cpp
index 1deb93c97..87ade954c 100644
--- a/src/validationinterface.cpp
+++ b/src/validationinterface.cpp
@@ -90,6 +90,7 @@ void RegisterValidationInterface(CValidationInterface* pwalletIn) {
 void UnregisterValidationInterface(CValidationInterface* pwalletIn) {
     if (g_signals.m_internals) {
         g_signals.m_internals->m_connMainSignals.erase(pwalletIn);
+        g_signals.FlushBackgroundCallbacks();
     }
 }
 
diff --git a/src/wallet/wallet.cpp b/src/wallet/wallet.cpp
index 4c36caed8..51732c0ae 100644
--- a/src/wallet/wallet.cpp
+++ b/src/wallet/wallet.cpp
@@ -111,7 +111,7 @@ static void ReleaseWallet(CWallet* wallet)
     wallet->WalletLogPrintf("Releasing wallet\n");
     wallet->BlockUntilSyncedToCurrentChain();
     wallet->m_chain_notifications_handler.reset();
-    WITH_LOCK(wallet->cs_wallet, wallet->Flush());
+    wallet->Flush();
     delete wallet;
     // Wallet is now released, notify UnloadWallet, if any.
     {

Flushing callback right after disconnecting signals should effectively ensure we don't have pending and it will not need atomic notification pointer or so.

@ryanofsky
Copy link
Contributor

re: #18279 (comment)

@promag, @ryanofsky i have another, looking better to me, solution ensuring there is no background tasks, but i've not push since it's not well tested

This fix is very similar to c372219 from #18280, and probably we should review and merge #18280 when it's ready and close this PR.

In my opinion though, waiting for notifications is a fragile fix that will slow down wallet unloading and node shutdown without good reason. A better fix would be to switch to shared pointers instead of raw pointers and fix the use-after-delete bug more directly: 06d2b53 (branch, comment)

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 11, 2020

This fix is very similar to c372219 from #18280, and probably we should review and merge #18280 when it's ready and close this PR.

Still no to me, changes in BlockUntilSyncedToCurrentChain are significant to me.

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 11, 2020

Also sync should be in UnregisterValidationInterface to be applied in all cases.

@ryanofsky
Copy link
Contributor

ryanofsky commented Mar 11, 2020

Still no to me, changes in BlockUntilSyncedToCurrentChain are significant to me.

Of course no need to close this PR if you want to make other improvements, but I hope you can review #18280 when it's ready as a fix for the original issue. Also would encourage closing this PR and opening a new one if the new version will differ substantially, so the review comment history will be intelligible.

I would also caution that while BlockUntilSyncedToCurrentChain can definitely be cleaned up, its purpose is ensuring consistent results between RPC calls and it shouldn't be changed to block longer than actually necessary to fix real bugs.

Also sync should be in UnregisterValidationInterface to be applied in all cases.

This is only true when unregistering wallets, as far as I know. Not things like indexes. Also if we fix shared_ptr reference counting with 06d2b53 (branch, comment) it should not even be necessary for wallets. No reason conceptually wallets should have to wait for events they don't care about just to fix a pointer lifetime issue

@bvbfan
Copy link
Contributor Author

bvbfan commented Mar 11, 2020

This is only true when unregistering wallets, as far as I know. Not things like indexes.

No, that's not correct to me,

Maybe m_notifications should be an atomic pointer too since m_notifications is get and set from different threads, though probably the synchronization done by boost signals & the scheduler prevents bugs from it not being atomic

doing in UnregisterValidationInterface it ensures notification is not accessed in race condition.

@promag
Copy link
Contributor

promag commented Mar 11, 2020

No reason conceptually wallets should have to wait for events they don't care about just to fix a pointer lifetime issue

Agree with this.

From #18280:

From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list.

This means that UnregisterValidationInterface doesn't prevent more calls to that interface.

But just calling SyncWithValidationInterfaceQueue (which BlockUntilSyncedToCurrentChain might call) doesn't fully fix the issue because on shutdown the scheduler thread is stopped earlier.

Obviously there are various possible fixes, like #18280 or 06d2b53. But this approach is clearly not good.

@bvbfan bvbfan closed this Mar 15, 2020
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: crash after releasing wallet
7 participants