-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Fix wallet unload race condition #18338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Does not fix the root issue, which is notification thread hold mutex (no matter when notification is disconnected if it's the last instance of the wallet) then goes in Release wallet, flush, then deleting mutex while it's hold is SIGSEGV. Unit test does not do anything to ensure that. |
Can you give instructions to reproduce the issue or post a backtrace of the SIGSEGV? ReleaseWallet should not be called during a notification because ReleaseWallet is only called when the shared pointer reference count reaches 0. The unit test verifies that the reference count doesn't reach 0 during a notification until after the notification callback method returns, by checking that the destructor is called only after the callback returns. |
Running on testnet from scratch in parallel with this script it should not crash. Whole point of |
@bvbfan that script crashes Still investigating but looking at call stack it's caused by queued connections "in flight" (like |
It looks like another issue, wallet model holds a unique_ptr to interface wallet which in other hand have a shared_ptr to actual wallet. Since |
Thanks for clarifying this. I had tried a bunch of ways to reproduce the issue with -regtest and could never get it to happen. But with cf4cb28 and -testnet I was able to get the deadlock to happen in just a few minutes. I still haven't been able to reproduce a SIGSEGV or any problem with this current PR 73bd962, though. So again if you have a stack trace it would help. Or it would help if you could describe in a more step-by-step way how you see the SIGSEGV happening |
Since |
What may not happen? |
Oh, maybe you are referring to the bitcoin-qt bug. I haven't tried to reproduce that yet. |
re: #18338 (comment)
@promag, I was able to reproduce this and I think it's basically a separate issue from the deadlock that happens with bitcoind. I created a bug report #18362. Would move any discussion of that issue there |
src/wallet/wallet.cpp
Outdated
@@ -137,6 +133,9 @@ void UnloadWallet(std::shared_ptr<CWallet>&& wallet) | |||
// Notify the unload intent so that all remaining shared pointers are | |||
// released. | |||
wallet->NotifyUnload(); | |||
|
|||
wallet->m_chain_notifications_handler.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockUntilSyncedToCurrentChain
is still needed, unless we can expect data loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #18338 (comment)
BlockUntilSyncedToCurrentChain
is still needed, unless we can expect data loss
BlockUntilSyncedToCurrentChain can't be called with locks held and it doesn't make sense here. It makes sense at beginning of RPC calls to ensure consistency between calls. It doesn't directly relate to writing or consolidating data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where lock is held? If here it's, that's nasty bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where lock is held? If here it's, that's nasty bug.
Right, I don't think there is a nasty bug now. I think it is good to avoid nasty bugs by not calling BlockUntilSyncedToCurrentChain when it is not necessary.
@@ -151,7 +159,7 @@ class CMainSignals { | |||
private: | |||
std::unique_ptr<MainSignalsInstance> m_internals; | |||
|
|||
friend void ::RegisterValidationInterface(CValidationInterface*); | |||
friend void ::RegisterSharedValidationInterface(std::shared_ptr<CValidationInterface>); | |||
friend void ::UnregisterValidationInterface(CValidationInterface*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not shared_ptr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #18338 (comment)
Why not shared_ptr here?
This is an existing function the PR isn't changing (and there's no reason to change it)
In #18362 the bug is mostly |
]> Basically i did not like transfer ownership of wallet to the notification that's not expected to me. Sure, but if you are already using a shared_ptr to refer to an object some places, it makes sense to use it consistently. Combining shared pointers and raw pointers and asynchronous notifications is a recipe for bugs. |
d385f98
to
838ea49
Compare
Some notes for future cleanup. I think we can clean up shutdown sequence after this PR, since right now it's confusing what responsibilities of different unloading functions are. I think I'd:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 41b0baf. Only change is moving assert as suggested
{ | ||
RegisterSharedValidationInterface(m_proxy); | ||
} | ||
~NotificationsHandlerImpl() override { disconnect(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The destructor of the derived class should not be declared with override
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #18338 (comment)
nit: The destructor of the derived class should not be declared with
override
.
This is a strange suggestion and I would suggest reverting 17316b3 to 41b0baf to undo it. The point of override keyword is to get the compiler to enforce assumptions the derived class is making about the base class to avoid terrible bugs. In this case, override keyword makes it an error for the base class not to have a virtual destructor. This is important because if base destructor was made not virtual there would be a difficult to trace memory leak here with NotificationsHandlerImpl object getting freed but destructor never running and m_proxy member never being freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A desctructor is not a regular member function. So I was confused with the following opinions:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, override keyword makes it an error for the base class not to have a virtual destructor.
I've verified this statement with the following diff:
--- a/src/interfaces/handler.h
+++ b/src/interfaces/handler.h
@@ -22,7 +22,7 @@ namespace interfaces {
class Handler
{
public:
- virtual ~Handler() {}
+ ~Handler() {}
//! Disconnect the handler.
virtual void disconnect() = 0;
My compiler raises an error. You are correct.
TIL. Thank you.
} | ||
explicit NotificationsProxy(std::shared_ptr<Chain::Notifications> notifications) | ||
: m_notifications(std::move(notifications)) {} | ||
virtual ~NotificationsProxy() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is the NotificationsProxy
class intended for subclassing? If not, why its destructor is declared with virtual
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #18338 (comment)
nit: Is the
NotificationsProxy
class intended for subclassing? If not, why its destructor is declared withvirtual
?
Would again suggest reverting 17316b3 to 41b0baf to undo this change. The point of declaring this virtual is to prevent memory leaks and incomplete destruction in case this is subclassed (whether intended or not). From https://isocpp.org/wiki/faq/virtual-functions#virtual-dtors:
Here’s a simplified rule of thumb that usually protects you and usually doesn’t cost you anything: make your destructor virtual if your class has any virtual functions.
This is an important rule to follow here and other places. Other alternatives would be to declare the class final or give it a protected nonvirtual destructor, but these are optimizations that are more fragile. Just having a virtual destructor when there are virtual functions is usually the simplest safe approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for noise. I forgot that member functions defined with override
are still virtual
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanofsky ibid:
Note: in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanofsky ibid:
Note: in a derived class, if your base class has a virtual destructor, your own destructor is automatically virtual.
Yes, the problems happen when code changes over time. Without the override keyword somebody could make a change to the base class removing virtual
and unknowingly cause memory leaks in the derived class. But with the override keyword, the compiler will show a "does not override" error building the derived class and not allow this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that consistent usage of override
in derived classes protects from bugs caused by changes in base classes over time.
But how virtual
could help for a desctructor in a derived class, if:
in a derived class, if your base class has a
virtual
destructor, your own destructor is automaticallyvirtual
. You might need an explicitly defined destructor for other reasons, but there’s no need to redeclare a destructor simply to make sure it isvirtual
. No matter whether you declare it with thevirtual
keyword, declare it without thevirtual
keyword, or don’t declare it at all, it’s stillvirtual
.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how
virtual
could help for a desctructor in a derived class, if:in a derived class, if your base class has a
virtual
destructor, your own destructor is automaticallyvirtual
. You might need an explicitly defined destructor for other reasons, but there’s no need to redeclare a destructor simply to make sure it isvirtual
. No matter whether you declare it with thevirtual
keyword, declare it without thevirtual
keyword, or don’t declare it at all, it’s stillvirtual
.?
None of that is applicable here. NotificationsProxy base class is CValidationInterface which does not have a virtual destructor. If it did have a virtual destructor, this could be declared with override
instead of virtual
, and registration would no longer require shared pointers with custom deleters. But these changes would increase size of the bugfix and are better saved for a followup PR. This PR is just trying to do the simple thing and
- Declare virtual destructors where there are virtual functions
- Use
override
instead ofvirtual
where it prevents derived class code from being silently broken by future changes to the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed that:
bitcoin/src/validationinterface.h
Lines 77 to 83 in 41b0baf
class CValidationInterface { | |
protected: | |
/** | |
* Protected destructor so that instances can only be deleted by derived classes. | |
* If that restriction is no longer desired, this should be made public and virtual. | |
*/ | |
~CValidationInterface() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 41b0baf, tested on Linux Mint 19.3.
re-ACK 17316b3, only suggested changes about class destructors. |
Reverted. |
Is my previous ACK valid if a commit is reverted to the ACKed one? |
Code review ACK 41b0baf (previously acked but just repeating for clarity). This might be ready for merge. Lot of tweaking at the end but changes: f80e957 -> e66da18 This is a minimal bugfix and good candidate for merging before branching or for backporting. |
41b0baf gui: Handle WalletModel::unload asynchronous (João Barbosa) ab31b9d Fix wallet unload race condition (Russell Yanofsky) Pull request description: This PR consists in two fixes. The first fixes a concurrency issues with `boost::signals2`. The second fixes a wallet model destruction while it's being used. From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html: > When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list. This means that `UnregisterValidationInterface` doesn't prevent more calls to that interface. The fix consists in capturing the `shared_ptr<CValidationInterface>` in each internal slot. The GUI bug is fixed by using a `Qt::QueuedConnection` in the `WalletModel::unload` connection. ACKs for top commit: ryanofsky: Code review ACK 41b0baf. Only change is moving assert as suggested hebasto: ACK 41b0baf, tested on Linux Mint 19.3. Tree-SHA512: 4f712d8de65bc1214411831250de5dc0a9fd505fb84da5baf9f2cc4d551bc3abffc061616f00afe43dba7525af2cd96c9b54aeead9383145e3b8801f25d85f50
- rpc/wallet: initialize nFeeRequired to avoid using garbage value on failure bitcoin#18274 - httpserver: use own HTTP status codes bitcoin#18168 - tests: Add fuzzing harnesses for various Base{32,58,64} and hex related functions bitcoin#17229 - util: Don't allow Base32/64-decoding or ParseMoney(…) on strings with embedded NUL characters. Add tests. bitcoin#17753 - util: Fail to parse empty string in ParseMoney bitcoin#18225 - util: Fail to parse whitespace-only strings in ParseMoney(...) (instead of parsing as zero) bitcoin#18270 - Replace the LogPrint function with a macro bitcoin#17218 - Fix wallet unload race condition bitcoin#18338 - qt: Fix Window -> Minimize menu item bitcoin#18549 - windows: Enable heap terminate-on-corruption bitcoin#17916
9ce0335 Add comment saying how to fix clientInvoke missing Proxy.Context assert (Russell Yanofsky) 31b4f1b Add shared_ptr ownership and lifetime support (Russell Yanofsky) 27f8a35 Add saveCallback / callbackSaved test setup (Russell Yanofsky) 5390a1b Add support for passing shared_ptrs without extending lifetime (Russell Yanofsky) 39bbf74 Add CustomReadField priority param for more flexibility and consistency with CustomBuildField (Russell Yanofsky) da489be Fix bugs in PassField overload for callback objects passed by reference (Russell Yanofsky) ab4568b Add test coverage for thread map and callbacks (Russell Yanofsky) Pull request description: This is needed after bitcoin/bitcoin#18338 which changed `handleNotifications()` `Notifications&` callback argument to `std::shared_ptr<Notifications>`. Easiest way to support this was to change `ProxyServerBase` reference from a raw pointer to shared pointer and make necessary BuildField changes to pass along the `shared_ptr`. Alternative might have been to add more generic cleanup support to `ProxyServerBase` instead of hardcoding `shared_ptr`. The change also required making the ReadField callback overload more generic, which was a straightforward but kind of big change that touched a lot of code. There weren't any unit tests for callback support previously, so a lot of new test coverage was added. It includes coverage for `shared_ptr` lifetime correctness, making sure there's an IPC call decrementing server `shared_ptr` reference count when client `shared_ptr` proxy is reset. Top commit has no ACKs. Tree-SHA512: 96607bb339a5184ab34f9c133f7d5a9a6ac37a7a71187bdd6bd15235b44690cf19d8c09ad6e6966e51886cb00036cecf268406af7afd749b659970e9b00179ee
Summary: 41b0baf43c243b64b394e774e336475a489cca2b gui: Handle WalletModel::unload asynchronous (João Barbosa) ab31b9d6fe7b39713682e3f52d11238dbe042c16 Fix wallet unload race condition (Russell Yanofsky) Pull request description: This PR consists in two fixes. The first fixes a concurrency issues with `boost::signals2`. The second fixes a wallet model destruction while it's being used. From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html: > When a signal is invoked by calling signal::operator(), the invocation first acquires a lock on the signal's mutex. Then it obtains a handle to the signal's slot list and combiner. Next it releases the signal's mutex, before invoking the combiner to iterate through the slot list. This means that `UnregisterValidationInterface` doesn't prevent more calls to that interface. The fix consists in capturing the `shared_ptr<CValidationInterface>` in each internal slot. The GUI bug is fixed by using a `Qt::QueuedConnection` in the `WalletModel::unload` connection. ACKs for top commit: ryanofsky: Code review ACK 41b0baf43c243b64b394e774e336475a489cca2b. Only change is moving assert as suggested hebasto: ACK 41b0baf43c243b64b394e774e336475a489cca2b, tested on Linux Mint 19.3. Tree-SHA512: 4f712d8de65bc1214411831250de5dc0a9fd505fb84da5baf9f2cc4d551bc3abffc061616f00afe43dba7525af2cd96c9b54aeead9383145e3b8801f25d85f50 Backport of Core [[bitcoin/bitcoin#18338 | PR18338]] Test Plan: `ninja check check-functional` for sanity Reviewers: #bitcoin_abc, deadalnix Reviewed By: #bitcoin_abc, deadalnix Differential Revision: https://reviews.bitcoinabc.org/D6644
This PR consists in two fixes. The first fixes a concurrency issues with
boost::signals2
. The second fixes a wallet model destruction while it's being used.From boost signal documentation at https://www.boost.org/doc/libs/1_72_0/doc/html/signals2/thread-safety.html:
This means that
UnregisterValidationInterface
doesn't prevent more calls to that interface. The fix consists in capturing theshared_ptr<CValidationInterface>
in each internal slot.The GUI bug is fixed by using a
Qt::QueuedConnection
in theWalletModel::unload
connection.