Skip to content

Conversation

promag
Copy link
Contributor

@promag promag commented Mar 13, 2020

This PR consists in two fixes. The first fixes a concurrency issues with boost::signals2. The second fixes a wallet model destruction while it's being used.

From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list.

This means that UnregisterValidationInterface doesn't prevent more calls to that interface. The fix consists in capturing the shared_ptr<CValidationInterface> in each internal slot.

The GUI bug is fixed by using a Qt::QueuedConnection in the WalletModel::unload connection.

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 13, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@ryanofsky
Copy link
Contributor

I wrote a test for new register/unregister functions that could be added here: d65b487 (branch)

@bvbfan
Copy link
Contributor

bvbfan commented Mar 13, 2020

unregister from notifications earlier in UnloadWallet instead of
ReleaseWallet, and use a new RegisterSharedValidationInterface function to
prevent the CValidationInterface shared_ptr from being deleted until the last
notification is actually finished.

Does not fix the root issue, which is notification thread hold mutex (no matter when notification is disconnected if it's the last instance of the wallet) then goes in Release wallet, flush, then deleting mutex while it's hold is SIGSEGV. Unit test does not do anything to ensure that.

@ryanofsky
Copy link
Contributor

Does not fix the root issue, which is notification thread hold mutex (no matter when notification is disconnected if it's the last instance of the wallet) then goes in Release wallet, flush, then deleting mutex while it's hold is SIGSEGV. Unit test does not do anything to ensure that.

Can you give instructions to reproduce the issue or post a backtrace of the SIGSEGV? ReleaseWallet should not be called during a notification because ReleaseWallet is only called when the shared pointer reference count reaches 0. The unit test verifies that the reference count doesn't reach 0 during a notification until after the notification callback method returns, by checking that the destructor is called only after the callback returns.

@bvbfan
Copy link
Contributor

bvbfan commented Mar 14, 2020

#!/bin/bash
while [ 1 ]
do
    src/bitcoin-cli -testnet loadwallet test
    src/bitcoin-cli -testnet unloadwallet test
done

Running on testnet from scratch in parallel with this script it should not crash. Whole point of EmptyQueue was to not have notification holding wallet mutex while we delete it.

@promag
Copy link
Contributor Author

promag commented Mar 16, 2020

@bvbfan that script crashes bitcoin-qt -regtest -server -printtoconsole.

Still investigating but looking at call stack it's caused by queued connections "in flight" (like WalletController::walletAdded(WalletModel*)) while the wallet model is deleted in WalletController::removeAndDeleteWallet.

@bvbfan
Copy link
Contributor

bvbfan commented Mar 16, 2020

It looks like another issue, wallet model holds a unique_ptr to interface wallet which in other hand have a shared_ptr to actual wallet. Since UnloadWallet actually deletes wallet (it gets it from wallet map) unless wallet model outlive wallet. First make sure it works like that
bitcoind -testnet (it's testnet because every few seconds new block is connected and notification is activated)

@ryanofsky
Copy link
Contributor

First make sure it works like that
bitcoind -testnet (it's testnet because every few seconds new block is connected and notification is activated)

Thanks for clarifying this. I had tried a bunch of ways to reproduce the issue with -regtest and could never get it to happen. But with cf4cb28 and -testnet I was able to get the deadlock to happen in just a few minutes.

I still haven't been able to reproduce a SIGSEGV or any problem with this current PR 73bd962, though. So again if you have a stack trace it would help. Or it would help if you could describe in a more step-by-step way how you see the SIGSEGV happening

@bvbfan
Copy link
Contributor

bvbfan commented Mar 16, 2020

I still haven't been able to reproduce a SIGSEGV or any problem with this current PR 73bd962, though. So again if you have a stack trace it would help. Or it would help if you could describe in a more step-by-step way how you see the SIGSEGV happening

Since UnloadWallet actively deletes wallet that's may not happen since g_wallet_init_interface is global and it's call UnloadWallet in its destructor.

@ryanofsky
Copy link
Contributor

I still haven't been able to reproduce a SIGSEGV or any problem with this current PR 73bd962, though. So again if you have a stack trace it would help. Or it would help if you could describe in a more step-by-step way how you see the SIGSEGV happening

Since UnloadWallet actively deletes wallet that's may not happen since g_wallet_init_interface is global and it's call UnloadWallet in its destructor.

What may not happen?

@ryanofsky
Copy link
Contributor

What may not happen?

Oh, maybe you are referring to the bitcoin-qt bug. I haven't tried to reproduce that yet.

@ryanofsky
Copy link
Contributor

re: #18338 (comment)

@bvbfan that script crashes bitcoin-qt -regtest -server -printtoconsole.

Still investigating but looking at call stack it's caused by queued connections "in flight" (like WalletController::walletAdded(WalletModel*)) while the wallet model is deleted in WalletController::removeAndDeleteWallet.

@promag, I was able to reproduce this and I think it's basically a separate issue from the deadlock that happens with bitcoind. I created a bug report #18362. Would move any discussion of that issue there

@@ -137,6 +133,9 @@ void UnloadWallet(std::shared_ptr<CWallet>&& wallet)
// Notify the unload intent so that all remaining shared pointers are
// released.
wallet->NotifyUnload();

wallet->m_chain_notifications_handler.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockUntilSyncedToCurrentChain is still needed, unless we can expect data loss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18338 (comment)

BlockUntilSyncedToCurrentChain is still needed, unless we can expect data loss

BlockUntilSyncedToCurrentChain can't be called with locks held and it doesn't make sense here. It makes sense at beginning of RPC calls to ensure consistency between calls. It doesn't directly relate to writing or consolidating data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where lock is held? If here it's, that's nasty bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where lock is held? If here it's, that's nasty bug.

Right, I don't think there is a nasty bug now. I think it is good to avoid nasty bugs by not calling BlockUntilSyncedToCurrentChain when it is not necessary.

@@ -151,7 +159,7 @@ class CMainSignals {
private:
std::unique_ptr<MainSignalsInstance> m_internals;

friend void ::RegisterValidationInterface(CValidationInterface*);
friend void ::RegisterSharedValidationInterface(std::shared_ptr<CValidationInterface>);
friend void ::UnregisterValidationInterface(CValidationInterface*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not shared_ptr here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18338 (comment)

Why not shared_ptr here?

This is an existing function the PR isn't changing (and there's no reason to change it)

@bvbfan
Copy link
Contributor

bvbfan commented Mar 16, 2020

In #18362 the bug is mostly horizontalHeader is nullptr (invalid pointer).
Basically i did not like transfer ownership of wallet to the notification that's not expected to me.

@ryanofsky
Copy link
Contributor

]> Basically i did not like transfer ownership of wallet to the notification that's not expected to me.

Sure, but if you are already using a shared_ptr to refer to an object some places, it makes sense to use it consistently. Combining shared pointers and raw pointers and asynchronous notifications is a recipe for bugs.

@ryanofsky
Copy link
Contributor

Some notes for future cleanup.

I think we can clean up shutdown sequence after this PR, since right now it's confusing what responsibilities of different unloading functions are. I think I'd:

  • Move log print and flush out of ReleaseWallet into CWallet destructor
  • Rename ReleaseWallet to DeleteWallet since only thing it's doing is deleting the wallet pointer
  • Move handler.reset and NotifyUnload calls to new CWallet::Unload method, and call that method from RemoveWallet, so GUI is notified as early as possible and CWallet data members aren't being accessed from outside functions
  • Rename UnloadWallet to WaitForDeleteWallet since all it will be doing at that point is resetting the pointer and waiting for the gui and validation interfaces to let go their wallet references

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 41b0baf. Only change is moving assert as suggested

{
RegisterSharedValidationInterface(m_proxy);
}
~NotificationsHandlerImpl() override { disconnect(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The destructor of the derived class should not be declared with override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18338 (comment)

nit: The destructor of the derived class should not be declared with override.

This is a strange suggestion and I would suggest reverting 17316b3 to 41b0baf to undo it. The point of override keyword is to get the compiler to enforce assumptions the derived class is making about the base class to avoid terrible bugs. In this case, override keyword makes it an error for the base class not to have a virtual destructor. This is important because if base destructor was made not virtual there would be a difficult to trace memory leak here with NotificationsHandlerImpl object getting freed but destructor never running and m_proxy member never being freed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanofsky

In this case, override keyword makes it an error for the base class not to have a virtual destructor.

I've verified this statement with the following diff:

--- a/src/interfaces/handler.h
+++ b/src/interfaces/handler.h
@@ -22,7 +22,7 @@ namespace interfaces {
 class Handler
 {
 public:
-    virtual ~Handler() {}
+    ~Handler() {}
 
     //! Disconnect the handler.
     virtual void disconnect() = 0;

My compiler raises an error. You are correct.
TIL. Thank you.

}
explicit NotificationsProxy(std::shared_ptr<Chain::Notifications> notifications)
: m_notifications(std::move(notifications)) {}
virtual ~NotificationsProxy() = default;
Copy link
Member

@hebasto hebasto Mar 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is the NotificationsProxy class intended for subclassing? If not, why its destructor is declared with virtual?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18338 (comment)

nit: Is the NotificationsProxy class intended for subclassing? If not, why its destructor is declared with virtual?

Would again suggest reverting 17316b3 to 41b0baf to undo this change. The point of declaring this virtual is to prevent memory leaks and incomplete destruction in case this is subclassed (whether intended or not). From https://isocpp.org/wiki/faq/virtual-functions#virtual-dtors:

Here’s a simplified rule of thumb that usually protects you and usually doesn’t cost you anything: make your destructor virtual if your class has any virtual functions.

This is an important rule to follow here and other places. Other alternatives would be to declare the class final or give it a protected nonvirtual destructor, but these are optimizations that are more fragile. Just having a virtual destructor when there are virtual functions is usually the simplest safe approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for noise. I forgot that member functions defined with override are still virtual.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanofsky ibid:

Note: in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanofsky ibid:

Note: in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual.

Yes, the problems happen when code changes over time. Without the override keyword somebody could make a change to the base class removing virtual and unknowingly cause memory leaks in the derived class. But with the override keyword, the compiler will show a "does not override" error building the derived class and not allow this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that consistent usage of override in derived classes protects from bugs caused by changes in base classes over time.

But how virtual could help for a desctructor in a derived class, if:

in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual. You might need an explicitly defined destructor for other reasons, but there’s no need to redeclare a destructor simply to make sure it is virtual. No matter whether you declare it with the virtual keyword, declare it without the virtual keyword, or don’t declare it at all, it’s still virtual.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how virtual could help for a desctructor in a derived class, if:

in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual. You might need an explicitly defined destructor for other reasons, but there’s no need to redeclare a destructor simply to make sure it is virtual. No matter whether you declare it with the virtual keyword, declare it without the virtual keyword, or don’t declare it at all, it’s still virtual.

?

None of that is applicable here. NotificationsProxy base class is CValidationInterface which does not have a virtual destructor. If it did have a virtual destructor, this could be declared with override instead of virtual, and registration would no longer require shared pointers with custom deleters. But these changes would increase size of the bugfix and are better saved for a followup PR. This PR is just trying to do the simple thing and

  • Declare virtual destructors where there are virtual functions
  • Use override instead of virtual where it prevents derived class code from being silently broken by future changes to the base class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that:

class CValidationInterface {
protected:
/**
* Protected destructor so that instances can only be deleted by derived classes.
* If that restriction is no longer desired, this should be made public and virtual.
*/
~CValidationInterface() = default;

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 41b0baf, tested on Linux Mint 19.3.

@hebasto
Copy link
Member

hebasto commented Mar 29, 2020

re-ACK 17316b3, only suggested changes about class destructors.

@promag
Copy link
Contributor Author

promag commented Mar 30, 2020

Reverted.

@ryanofsky
Copy link
Contributor

Code review ACK 17316b3, but I did prefer 41b0baf with safe, intact destructors better. Either one is fine to merge though.

Besides the destructors other things on the future cleanup list are:

@hebasto
Copy link
Member

hebasto commented Mar 30, 2020

Is my previous ACK valid if a commit is reverted to the ACKed one?

@hebasto
Copy link
Member

hebasto commented Mar 30, 2020

@promag

@promag Why tests (d8102ce) are dropped?

Two travis jobs are getting stuck, I'm trying to understand the problem.

Any update about tests?

@ryanofsky
Copy link
Contributor

Code review ACK 41b0baf (previously acked but just repeating for clarity).

This might be ready for merge. Lot of tweaking at the end but changes: f80e957 -> e66da18
41b0baf -> 17316b3 -> 41b0baf were all trivial.

This is a minimal bugfix and good candidate for merging before branching or for backporting.

@laanwj laanwj merged commit 9a2b5f2 into bitcoin:master Mar 31, 2020
@promag promag deleted the notify-shared branch March 31, 2020 13:08
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Mar 31, 2020
41b0baf gui: Handle WalletModel::unload asynchronous (João Barbosa)
ab31b9d Fix wallet unload race condition (Russell Yanofsky)

Pull request description:

  This PR consists in two fixes. The first fixes a concurrency issues with `boost::signals2`. The second fixes a wallet model destruction while it's being used.

  From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

  > When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list.

  This means that `UnregisterValidationInterface` doesn't prevent more calls to that interface. The fix consists in capturing the `shared_ptr<CValidationInterface>` in each internal slot.

  The GUI bug is fixed by using a `Qt::QueuedConnection` in the `WalletModel::unload` connection.

ACKs for top commit:
  ryanofsky:
    Code review ACK 41b0baf. Only change is moving assert as suggested
  hebasto:
    ACK 41b0baf, tested on Linux Mint 19.3.

Tree-SHA512: 4f712d8de65bc1214411831250de5dc0a9fd505fb84da5baf9f2cc4d551bc3abffc061616f00afe43dba7525af2cd96c9b54aeead9383145e3b8801f25d85f50
MarkLTZ added a commit to litecoinz-core/litecoinz that referenced this pull request Apr 9, 2020
- rpc/wallet: initialize nFeeRequired to avoid using garbage value on failure bitcoin#18274
- httpserver: use own HTTP status codes bitcoin#18168
- tests: Add fuzzing harnesses for various Base{32,58,64} and hex related functions bitcoin#17229
- util: Don't allow Base32/64-decoding or ParseMoney(…) on strings with embedded NUL characters. Add tests. bitcoin#17753
- util: Fail to parse empty string in ParseMoney bitcoin#18225
- util: Fail to parse whitespace-only strings in ParseMoney(...) (instead of parsing as zero) bitcoin#18270
- Replace the LogPrint function with a macro bitcoin#17218
- Fix wallet unload race condition bitcoin#18338
- qt: Fix Window -> Minimize menu item bitcoin#18549
- windows: Enable heap terminate-on-corruption bitcoin#17916
ryanofsky added a commit to bitcoin-core/libmultiprocess that referenced this pull request Apr 10, 2020
9ce0335 Add comment saying how to fix clientInvoke missing Proxy.Context assert (Russell Yanofsky)
31b4f1b Add shared_ptr ownership and lifetime support (Russell Yanofsky)
27f8a35 Add saveCallback / callbackSaved test setup (Russell Yanofsky)
5390a1b Add support for passing shared_ptrs without extending lifetime (Russell Yanofsky)
39bbf74 Add CustomReadField priority param for more flexibility and consistency with CustomBuildField (Russell Yanofsky)
da489be Fix bugs in PassField overload for callback objects passed by reference (Russell Yanofsky)
ab4568b Add test coverage for thread map and callbacks (Russell Yanofsky)

Pull request description:

  This is needed after bitcoin/bitcoin#18338 which changed `handleNotifications()` `Notifications&` callback argument to `std::shared_ptr<Notifications>`.

  Easiest way to support this was to change `ProxyServerBase` reference from a raw pointer to shared pointer and make necessary BuildField changes to pass along the `shared_ptr`. Alternative might have been to add more generic cleanup support to `ProxyServerBase` instead of hardcoding `shared_ptr`.

  The change also required making the ReadField callback overload more generic, which was a straightforward but kind of big change that touched a lot of code.

  There weren't any unit tests for callback support previously, so a lot of new test coverage was added. It includes coverage for `shared_ptr` lifetime correctness, making sure there's an IPC call decrementing server `shared_ptr` reference count when client `shared_ptr` proxy is reset.

Top commit has no ACKs.

Tree-SHA512: 96607bb339a5184ab34f9c133f7d5a9a6ac37a7a71187bdd6bd15235b44690cf19d8c09ad6e6966e51886cb00036cecf268406af7afd749b659970e9b00179ee
deadalnix pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Jun 18, 2020
Summary:
41b0baf43c243b64b394e774e336475a489cca2b gui: Handle WalletModel::unload asynchronous (João Barbosa)
ab31b9d6fe7b39713682e3f52d11238dbe042c16 Fix wallet unload race condition (Russell Yanofsky)

Pull request description:

  This PR consists in two fixes. The first fixes a concurrency issues with `boost::signals2`. The second fixes a wallet model destruction while it's being used.

  From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:

  > When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list.

  This means that `UnregisterValidationInterface` doesn't prevent more calls to that interface. The fix consists in capturing the `shared_ptr<CValidationInterface>` in each internal slot.

  The GUI bug is fixed by using a `Qt::QueuedConnection` in the `WalletModel::unload` connection.

ACKs for top commit:
  ryanofsky:
    Code review ACK 41b0baf43c243b64b394e774e336475a489cca2b. Only change is moving assert as suggested
  hebasto:
    ACK 41b0baf43c243b64b394e774e336475a489cca2b, tested on Linux Mint 19.3.

Tree-SHA512: 4f712d8de65bc1214411831250de5dc0a9fd505fb84da5baf9f2cc4d551bc3abffc061616f00afe43dba7525af2cd96c9b54aeead9383145e3b8801f25d85f50

Backport of Core [[bitcoin/bitcoin#18338 | PR18338]]

Test Plan: `ninja check check-functional` for sanity

Reviewers: #bitcoin_abc, deadalnix

Reviewed By: #bitcoin_abc, deadalnix

Differential Revision: https://reviews.bitcoinabc.org/D6644
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants