-
Notifications
You must be signed in to change notification settings - Fork 37.7k
wallet: Handle concurrent wallet loading #19300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does the crash also happen in a python functional test? If yes, mind to add one? |
@MarcoFalke haven't tried but I'd say it should - but it will be one of those loop-until-it-happens test. I'll give it a try. |
does this need backport to 0.19? |
@MarcoFalke yes, and probably is a clean backport. |
Hi, just tested the running of several background processes of the following script :
Crash is immediate. Here's valgrind output:
|
@MarcoFalke pushed 50b7e82, I welcome any python ninja to make it more cool. |
for t in threads: | ||
t.join() | ||
global got_loading_error | ||
assert_equal(got_loading_error, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can assert that a race happened. Races are only intermittent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is pretty much guaranteed, loading takes a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "qa: Test concurrent wallet loading" (9b009fa)
I don't think we can assert that a race happened. Races are only intermittent.
I would agree with Marco that in general tests like this are fragile and bad and slow and more trouble than they are worth, but also agree with promag that in practice this test is unlikely to fail.
To avoid potential problems from this, I'd suggest adding a comment here like "got_loading_error condition is not guaranteed to be true but very unlikely to be false based on how long it takes to load a wallet", so in case this does fail, future developers don't waste time trying to figure it out and can ignore or remove the test
In general it's easier to write tests for race conditions by making them invasive and implementing them in c++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race does indeed happen: https://cirrus-ci.com/task/5075565516423168?command=ci#L5121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it was slower to spawn those threads than to load the wallet. Not sure about c++ test because it would need to force block inside loading. Maybe we could improve this first by waiting for all threads to be ready to load?
50b7e82
to
ca10f4e
Compare
@fscemama if you can please repeat the test but using this branch. |
Sure. Can you confirm I must compile this and re-test ? |
Successfull. Will send these kind of normal messages:
No crash. Hourra! |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
btw, couldn't get an error with the testing bash script on master (343c0bf) without code patching.
After #19249 the LoadWallet()
function could be annotated with EXCLUSIVE_LOCKS_REQUIRED(!g_loading_wallet_mutex)
:
diff --git a/src/wallet/wallet.cpp b/src/wallet/wallet.cpp
index c93c79116..7e73a2bac 100644
--- a/src/wallet/wallet.cpp
+++ b/src/wallet/wallet.cpp
@@ -99,7 +99,7 @@ std::unique_ptr<interfaces::Handler> HandleLoadWallet(LoadWalletFn load_wallet)
return interfaces::MakeHandler([it] { LOCK(cs_wallets); g_load_wallet_fns.erase(it); });
}
-static Mutex g_loading_wallet_mutex;
+Mutex g_loading_wallet_mutex;
static Mutex g_wallet_release_mutex;
static std::condition_variable g_wallet_release_cv;
static std::set<std::string> g_loading_wallet_set GUARDED_BY(g_loading_wallet_mutex);
diff --git a/src/wallet/wallet.h b/src/wallet/wallet.h
index cf000b0b7..3013fb360 100644
--- a/src/wallet/wallet.h
+++ b/src/wallet/wallet.h
@@ -49,11 +49,13 @@ struct bilingual_str;
//! by the shared pointer deleter.
void UnloadWallet(std::shared_ptr<CWallet>&& wallet);
+extern Mutex g_loading_wallet_mutex;
+
bool AddWallet(const std::shared_ptr<CWallet>& wallet);
bool RemoveWallet(const std::shared_ptr<CWallet>& wallet);
std::vector<std::shared_ptr<CWallet>> GetWallets();
std::shared_ptr<CWallet> GetWallet(const std::string& name);
-std::shared_ptr<CWallet> LoadWallet(interfaces::Chain& chain, const WalletLocation& location, bilingual_str& error, std::vector<bilingual_str>& warnings);
+std::shared_ptr<CWallet> LoadWallet(interfaces::Chain& chain, const WalletLocation& location, bilingual_str& error, std::vector<bilingual_str>& warnings) EXCLUSIVE_LOCKS_REQUIRED(!g_loading_wallet_mutex);
std::unique_ptr<interfaces::Handler> HandleLoadWallet(LoadWalletFn load_wallet);
enum class WalletCreationStatus {
|
Indeed :) On master:
|
Testing this PR with two parallel testing scripts and got different error messages:
|
@hebasto yes that's expected - one is from concurrent load and the other from concurrent unload. |
ca10f4e
to
9fc44e8
Compare
9fc44e8
to
9b009fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 9b009fa, tested on Linux Mint 20 (x86_64):
$ cat ~/test.sh
for i in {1..2}
do
bash -c "echo $1; src/bitcoin-cli -regtest loadwallet foo"
bash -c "echo $1; src/bitcoin-cli -regtest unloadwallet foo"
done
sleep 1s
$ ~/test.sh "A:" & ~/test.sh "B:"
[1] 46177
A:
B:
error code: -4
error message:
Wallet already being loading.
B:
error code: -18
error message:
Requested wallet does not exist or is not loaded
B:
error code: -4
error message:
Wallet already being loading.
B:
error code: -18
error message:
Requested wallet does not exist or is not loaded
{
"name": "foo",
"warning": ""
}
A:
A:
{
"name": "foo",
"warning": ""
}
A:
Also verified:
$ ./test/functional/wallet_multiwallet.py
$ ./test/functional/wallet_multiwallet.py --usecli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review good-but-not-ideal ACK 9b009fa
This is an improvement but I don't think it's the ideal fix. A friendlier API would just require callers to write something like:
status = loadwallet("mywallet")
if status not in (success, wallet_already_loaded): raise Error
listtransactions("mywallet")
instead of:
while (status = loadwallet("mywallet") == wallet_already_loading:
sleep()
if status not in (success, wallet_already_loaded): raise Error
listtransactions("mywallet")
This could be implemented by making the second loadwallet call block and wait for the first loadwallet call to finish, and then return the existing "duplicate wallet" error, instead of introducing a new error string and requiring the caller to poll before using the wallet. This change would also make simultaneous wallet load calls work more similar to the the way simultaneous wallet unload calls already work. I don't think it would be too much more complex: you'd have to add a condition variable wait and notify but could remove the new error handling.
But anyway, current fix is reasonable, fine to merge, and good to see implemented
for t in threads: | ||
t.join() | ||
global got_loading_error | ||
assert_equal(got_loading_error, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "qa: Test concurrent wallet loading" (9b009fa)
I don't think we can assert that a race happened. Races are only intermittent.
I would agree with Marco that in general tests like this are fragile and bad and slow and more trouble than they are worth, but also agree with promag that in practice this test is unlikely to fail.
To avoid potential problems from this, I'd suggest adding a comment here like "got_loading_error condition is not guaranteed to be true but very unlikely to be false based on how long it takes to load a wallet", so in case this does fail, future developers don't waste time trying to figure it out and can ignore or remove the test
In general it's easier to write tests for race conditions by making them invasive and implementing them in c++
If the 2nd waits then why return an error? Edit: actually, I wonder if loading an already loaded wallet should be an error in the first place.
The 2nd concurrent unload doesn't wait. I'm happy to make changes but I think current API is not ideal for concurrent clients. I had the idea of counting loads (incr load and decr unload) and then effective unload would happen when count is zero. This would simplify concurrent clients using the same wallet - each client must load->use->unload. But this is breaking change and so I think a new pair of RPC methods would be better like |
Because I think the point of changing this is to fix a race condition, not add a new one. If you load the same wallet twice one load should succeed, and one load should return error duplicate wallet. There shouldn't be an unpredictable length of time where both calls mysteriously succeed. Also, I presume code would be simpler not having a special case to suppress the error.
This would be a reasonable design, but I don't see a reason not to return the existing error. At this point all it would be doing is hiding information we know and breaking backwards compatibility.
Oh, I figured it did, but didn't look closely. Again I think it'd be better if it waited, to avoid any need to poll when you need a wallet unloaded. And I think it'd simplify both server & client implementations to drop the extra error condition and just return "succeeded", "failed", or "already not loaded", never "currently unloading".
Yes, I wouldn't suggest this. I think the API I'm suggesting is pretty ideal for concurrent clients because it would let them unload & unload wallets reliably and never poll. The only place where reference counting would be useful would be for concurrent clients that have no way of communicating with each other. But even in this case reference counting would be unreliable and leaky if something ever went wrong like a connection getting dropped or a client process being killed. Overall, I think the current fix in 9b009fa is good, and definitely better than having the bug. I think it would be simpler on both server and client sides if load and unload RPCs never returned "currently loading" or "currently unloading" and just waited to return definitive statuses. But this isn't necessary and I wouldn't want it to hold things up. I appreciate you fixing this and considering my suggestions! |
This PR might be close to being mergeable. It has acks from me and hebasto and some review comments from Marco. Suggestions I made to improve this aren't blocking and could be followed up later. |
I disagree, current API (with or without your suggestion) doesn't make it easy to concurrently use a wallet (load+use+unload) - client A unloads and messes client B. And in the case some coordination exists on the client side, those improvements aren't really necessary.
Right, For this I thought about a new RPC endpoint where the wallet counting would be tied to the connection and load/unload would happen indirectly. |
I don't think we disagree about anything here. This is irrelevant. Neither your fix, nor the one I'm suggesting tries to do this, because it would require storing client state in the server, which would be an awkward and unreliable way to design things.
Here is the use case: I want to write little shell script that unloads a wallet whether or not is it already loaded, maybe so I can back it up. Or I want to write a little shell script that loads a wallet, whether or not it is already loaded, maybe so I can check my balance or perform a transaction. With the current API writing either of these scripts requires checking 4 for different error states (success/failure/in-progress/already-loaded-or-unloaded) and writing a loop to poll the server. With the API I am proposing, no polling is required and there are only 3 states to check instead of 4. Also, the server-side implementation should be simpler. I am happy to implement it as a followup if the current PR is merged. |
Concept ACK 9b009fa I have not reviewed the code |
@ryanofsky Agree in making those improvements for future releases? IMO this is enough as a fix for latest version and to backport. |
Yes I acked this PR twice and commented repeatedly it would be fine to improve this later. |
Github-Pull: bitcoin#19300 Rebased-From: b9971ae
Github-Pull: bitcoin#19300 Rebased-From: 9b009fa
2b79ac7 Clean up separated ban/discourage interface (Pieter Wuille) 0477348 Replace automatic bans with discouragement filter (Pieter Wuille) e7f06f9 test: remove Cirrus CI FreeBSD job (fanquake) eb6b82a qa: Test concurrent wallet loading (João Barbosa) c9b49d2 wallet: Handle concurrent wallet loading (João Barbosa) cf0b5a9 tests: Check that segwit inputs in psbt have both UTXO types (Andrew Chow) 3228b59 psbt: always put a non_witness_utxo and don't remove it (Andrew Chow) ed5ec30 psbt: Allow both non_witness_utxo and witness_utxo (Andrew Chow) 68e0e6f rpc: show both UTXOs in decodepsbt (Andrew Chow) 27786d0 trivial: Suggested cleanups to surrounding code (Russell Yanofsky) 654420d wallet: Minimal fix to restore conflicted transaction notifications (Russell Yanofsky) febebc4 Fix WSL file locking by using flock instead of fcntl (Samuel Dobson) 5c7151a gui: update Qt base translations for macOS release (fanquake) c219d21 build: improved output of configure for build OS (sachinkm77) 0596a6e util: Don't reference errno when pthread fails. (MIZUTA Takeshi) Pull request description: Currently backports the following to the 0.20 branch: * #18700 - Fix locking on WSL using flock instead of fcntl * #18982 - wallet: Minimal fix to restore conflicted transaction notifications * #19059 - gui: update Qt base translations for macOS release * #19152 - build: improve build OS configure output * #19194 - util: Don't reference errno when pthread fails. * #19215 - psbt: Include and allow both non_witness_utxo and witness_utxo for segwit inputs * #19219 - Replace automatic bans with discouragement filter * #19300 - wallet: Handle concurrent wallet loading ACKs for top commit: amitiuttarwar: ACK 0477348 2b79ac7 by comparing to original changes, double checking the diff sipa: utACK 2b79ac7 laanwj: ACK 2b79ac7 Tree-SHA512: e5eb31d08288fa4a6e8aba08e60b16ce1988f14f249238b1cfd18ab2c8f6fcd9f07e3c0c491d32d2361fca26e3037fb0374f37464bddcabecea29069d6737539
Github-Pull: #19300 Rebased-From: b9971ae
Github-Pull: #19300 Rebased-From: 9b009fa
Github-Pull: bitcoin#19300 Rebased-From: b9971ae
Github-Pull: bitcoin#19300 Rebased-From: 9b009fa
…t_loading_error fab48da test: Fix intermittent wallet_multiwallet issue with got_loading_error (MarcoFalke) fa8e15f test: pep8 wallet_multiwallet.py (MarcoFalke) Pull request description: Failing the test after 10 iterations without a loading error is problematic because it may take 11 iterations to get a loading error. Fix that by running until a loading error occurs, which should happen in almost all runs within the first 10 iterations. ACKs for top commit: ryanofsky: Code review ACK fab48da. This seems like a good workaround. I think more ideally think load and unload RPCs would not have racy status reporting (suggested previously bitcoin/bitcoin#19300 (review) and Tree-SHA512: 6b80b26d916276efe2a01af93bca7dbf71a3e67db9d3deb15175070719bf7d1325a1410d93e74c0316942e388faa2ba185dc9d3759c82d1c73c3c509b9997f05
…with got_loading_error fab48da test: Fix intermittent wallet_multiwallet issue with got_loading_error (MarcoFalke) fa8e15f test: pep8 wallet_multiwallet.py (MarcoFalke) Pull request description: Failing the test after 10 iterations without a loading error is problematic because it may take 11 iterations to get a loading error. Fix that by running until a loading error occurs, which should happen in almost all runs within the first 10 iterations. ACKs for top commit: ryanofsky: Code review ACK fab48da. This seems like a good workaround. I think more ideally think load and unload RPCs would not have racy status reporting (suggested previously bitcoin#19300 (review) and Tree-SHA512: 6b80b26d916276efe2a01af93bca7dbf71a3e67db9d3deb15175070719bf7d1325a1410d93e74c0316942e388faa2ba185dc9d3759c82d1c73c3c509b9997f05
Summary: This is a backport of [[bitcoin/bitcoin#19300 | core#19300]] Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D9967
This PR handles concurrent wallet loading.
This can be tested by running in parallel the following script a couple of times:
Eventually the error occurs:
For reference, loading and already loaded wallet gives:
Fixes #19232.