-
Notifications
You must be signed in to change notification settings - Fork 37.7k
multiprocess: Lock CapnpProtocol::m_loop with mutex #31815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31815. ReviewsSee the guideline for information on the review process. |
So far as of baea320 it seems like this change does not solve the problem it was supposed to fix. There is a new "libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" failure https://github.com/bitcoin/bitcoin/actions/runs/13206947916/job/36872233303?pr=30975#step:7:2433 in 553a7e2 which includes a cherry-picked version of this fix. |
This might have a bug somewhere, in #30975:
https://github.com/bitcoin/bitcoin/actions/runs/13206947916/job/36872233303?pr=30975 |
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running and the `EventLoop` object is potentially destroyed. Both cases happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reach and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking the mutex after calling `write()` so the code is updated to just use a simple `lock.unlock()` call and permanently let go of the lock and try to reacquire it. The second case happens in `EventLoop::post` where `Unlock()` is also used in a similar way, and depending on thread scheduling (if the post thread is delayed for a long time before relocking) can result in relocking `EventLoop::m_mutex` after calling `write()` to fail. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are just added to this code, so the `EventLoop::loop()` function will just not exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in libmultiprocess code or API. The `EventLoop` object itself was perfectly safe before these changes as long as outside code was waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added to binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned it meant these two methods could start failing due to their uses of `Unlock()` depending on thread scheduling. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient`, not the second case in `EventLoop::post`. This first case described above was not actually observed but is just theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
f8cbd0d bugfix: Do not lock EventLoop::mutex after EventLoop is done (user) Pull request description: Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154 Top commit has no ACKs. Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
48d01bc bugfix: Do not lock EventLoop::mutex after EventLoop is done (Ryan Ofsky) Pull request description: Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154 Top commit has no ACKs. Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
I reproduced the bug locally (see bitcoin-core/libmultiprocess#154 (comment)) and implemented a different fix in bitcoin-core/libmultiprocess#159, which should make these changes unnecessary, because instead of trying to delay eventloop destruction in Bitcoin code until all locks are released, it just avoids reacquiring a lock unnecessary and pretents eventloop::loop() from returning until all locks are released so it won't be destroyed. Will close this PR since these changes only incompletely fixed the bug and should no longer be needed. |
Thanks. I dropped this from #30975 and rebased on your latest #31741 which contains bitcoin-core/libmultiprocess#159. |
This change is intended to fix intermittent failures in
ipc_tests
unit test andrpc_misc.py
functional tests that happen on mac os, which sometimes fail with errorterminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
as reported bitcoin-core/libmultiprocess#154This error could happen when
m_loop.reset();
inCapnpProtocol::startLoop
executes before them_loop->removeClient(lock);
method returns in the~CapnpProtocol
destructor , causing an failure to reacquire theEventLoop::m_mutex
inside theEventLoop::removeClient
method, and the "mutex lock failed" exception to be thrown.Fix this error by adding a new mutex to guard the
CapnpProtocol::m_loop
variable and making surem_loop.reset();
cannot be called until afterm_loop->removeClient(lock);
returns.This PR is part of the process separation project.