Skip to content

Conversation

ryanofsky
Copy link
Contributor

@ryanofsky ryanofsky commented Feb 7, 2025

This change is intended to fix intermittent failures in ipc_tests unit test and rpc_misc.py functional tests that happen on mac os, which sometimes fail with error terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument as reported bitcoin-core/libmultiprocess#154

This error could happen when m_loop.reset(); in CapnpProtocol::startLoop executes before the m_loop->removeClient(lock); method returns in the ~CapnpProtocol destructor , causing an failure to reacquire the EventLoop::m_mutex inside the EventLoop::removeClient method, and the "mutex lock failed" exception to be thrown.

Fix this error by adding a new mutex to guard the CapnpProtocol::m_loop variable and making sure m_loop.reset(); cannot be called until after m_loop->removeClient(lock); returns.


This PR is part of the process separation project.

This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 7, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31815.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

@ryanofsky
Copy link
Contributor Author

So far as of baea320 it seems like this change does not solve the problem it was supposed to fix.

There is a new "libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" failure https://github.com/bitcoin/bitcoin/actions/runs/13206947916/job/36872233303?pr=30975#step:7:2433 in 553a7e2 which includes a cherry-picked version of this fix.

@Sjors
Copy link
Member

Sjors commented Feb 7, 2025

This might have a bug somewhere, in #30975:

test/ipc_tests.cpp:11: Entering test suite "ipc_tests"
test/ipc_tests.cpp:12: Entering test case "ipc_tests"
libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
2025-02-07T19:59:17.387846Z [unknown] [test/util/random.cpp:48] [SeedRandomStateForTest] Setting random seed for current tests to RANDOM_CTX_SEED=32db8d22914609b5d53f87f6947ca780fea50790b141c9ccc134e7a13f971121
2025-02-07T19:59:17.388318Z [test] [init/common.cpp:153] [LogPackageVersion] Bitcoin Core version v28.99.0-553a7e2f6664-dirty (release build)
2025-02-07T19:59:17.388468Z [test] [kernel/context.cpp:20] [operator()] Using the 'arm_shani(1way,2way)' SHA256 implementation
2025-02-07T19:59:17.397870Z [test] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client first request from current thread, constructing waiter
2025-02-07T19:59:17.398037Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.add$Params (a = 1, b = 2)
2025-02-07T19:59:17.398114Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #1 FooInterface.add$Params (a = 1, b = 2)
2025-02-07T19:59:17.398144Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #1 FooInterface.add$Results (result = 3)
2025-02-07T19:59:17.398166Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.add$Results (result = 3)
2025-02-07T19:59:17.398206Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-07T19:59:17.398231Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #2 FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-07T19:59:17.398244Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #2 FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-07T19:59:17.398262Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-07T19:59:17.398306Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-07T19:59:17.398324Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #3 FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-07T19:59:17.398338Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #3 FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-07T19:59:17.398354Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-07T19:59:17.398384Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-07T19:59:17.398404Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #4 FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-07T19:59:17.398448Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #4 FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-07T19:59:17.398467Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-07T19:59:17.398501Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-07T19:59:17.398516Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #5 FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-07T19:59:17.398525Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #5 FooInterface.passVectorChar$Results (result = "Hello")
2025-02-07T19:59:17.398538Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passVectorChar$Results (result = "Hello")
2025-02-07T19:59:17.398663Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-07T19:59:17.398704Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #6 FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-07T19:59:17.398725Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #6 FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-07T19:59:17.398749Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-07T19:59:17.398772Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-07T19:59:17.398790Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #7 FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-07T19:59:17.398801Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #7 FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-07T19:59:17.398819Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-07T19:59:17.398843Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client send FooInterface.passScript$Params (arg = "[")
2025-02-07T19:59:17.398858Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server recv request  #8 FooInterface.passScript$Params (arg = "[")
2025-02-07T19:59:17.398868Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server send response #8 FooInterface.passScript$Results (result = "[")
2025-02-07T19:59:17.398883Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/b-test-58682} IPC client recv FooInterface.passScript$Results (result = "[")
2025-02-07T19:59:17.399029Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} IPC server destroy N2mp11ProxyServerIN3gen12FooInterfaceEEE
2025-02-07T19:59:17.399089Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} EventLoop::loop done, cancelling event listeners.
2025-02-07T19:59:17.399098Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-24701/58689} EventLoop::loop bye.

https://github.com/bitcoin/bitcoin/actions/runs/13206947916/job/36872233303?pr=30975

ryanofsky pushed a commit to ryanofsky/libmultiprocess that referenced this pull request Feb 10, 2025
Currently, there are two cases where code may attempt to lock `EventLoop::mutex`
after the `EventLoop::loop()` method has finished running and the `EventLoop` object
is potentially destroyed. Both cases happen due to uses of the `Unlock()` utility
function which unlocks a mutex temporarily, runs a callback function and relocks
it.

The first case happens in `EventLoop::removeClient` when the `EventLoop::done()`
condition is reach and it calls `Unlock()` in order to be able to call
`write(post_fd, ...)` without blocking and wake the EventLoop thread. In this
case, since `EventLoop` execution is done there is not really any point to using
`Unlock()` and relocking the mutex after calling `write()` so the code is
updated to just use a simple `lock.unlock()` call and permanently let go of the
lock and try to reacquire it.

The second case happens in `EventLoop::post` where `Unlock()` is also used in a
similar way, and depending on thread scheduling (if the post thread is delayed
for a long time before relocking) can result in relocking `EventLoop::m_mutex`
after calling `write()` to fail. In this case, since relocking the mutex is
actually necessary for the code that follows, the fix is different: new
`addClient`/`removeClient` calls are just added to this code, so the
`EventLoop::loop()` function will just not exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in libmultiprocess code or API. The `EventLoop` object
itself was perfectly safe before these changes as long as outside code was
waited for `EventLoop` methods to finish executing before deleting the
`EventLoop` object. Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added to binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned it meant these two methods could start failing due
to their uses of `Unlock()` depending on thread scheduling.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient`, not the second case in `EventLoop::post`.

This first case described above was not actually observed but is just
theoretically possible. The second case happens intermittently on macos and was
reported bitcoin-core#154
ryanofsky pushed a commit to ryanofsky/libmultiprocess that referenced this pull request Feb 10, 2025
Currently, there are two cases where code may attempt to lock
`EventLoop::mutex` after the `EventLoop::loop()` method has finished running.
This is not a problem by itself, but is a problem if there is external code
which deletes the `EventLoop` object immediately after the method returns,
which is the case in Bitcoin Core. Both cases of locking after the loop method
returns happen due to uses of the `Unlock()` utility function which unlocks a
mutex temporarily, runs a callback function and relocks it.

The first case happens in `EventLoop::removeClient` when the
`EventLoop::done()` condition is reached and it calls `Unlock()` in order to be
able to call `write(post_fd, ...)` without blocking and wake the EventLoop
thread. In this case, since `EventLoop` execution is done there is not really
any point to using `Unlock()` and relocking, so the code is changed to use a
simple `lock.unlock()` call and not try to relock.

The second case happens in `EventLoop::post` where `Unlock()` is used in a
similar way, and depending on thread scheduling (if the thread is delayed for a
long time before relocking) can result in failing to relock
`EventLoop::m_mutex` after calling `write()`. In this case, since relocking the
mutex is actually necessary for the code that follows, the fix is different:
new `addClient`/`removeClient` calls are added to this code, so the
`EventLoop::loop()` function will never exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in the libmultiprocess code or API. The `EventLoop`
object itself was safe before these changes as long as outside code waited for
`EventLoop` methods to finish executing before deleting the `EventLoop` object.
Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added for binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned, it meant this code could now start failing to
relock after unlocking.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

This first case described above was not observed but is theoretically possible.
The second case happens intermittently on macos and was reported
bitcoin-core#154
ryanofsky added a commit to bitcoin-core/libmultiprocess that referenced this pull request Feb 10, 2025
f8cbd0d bugfix: Do not lock EventLoop::mutex after EventLoop is done (user)

Pull request description:

  Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it.

  The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock.

  The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting.

  These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in
  bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking.

  A previous attempt was made to fix this bug in
  bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

  This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154

Top commit has no ACKs.

Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Feb 10, 2025
Currently, there are two cases where code may attempt to lock
`EventLoop::mutex` after the `EventLoop::loop()` method has finished running.
This is not a problem by itself, but is a problem if there is external code
which deletes the `EventLoop` object immediately after the method returns,
which is the case in Bitcoin Core. Both cases of locking after the loop method
returns happen due to uses of the `Unlock()` utility function which unlocks a
mutex temporarily, runs a callback function and relocks it.

The first case happens in `EventLoop::removeClient` when the
`EventLoop::done()` condition is reached and it calls `Unlock()` in order to be
able to call `write(post_fd, ...)` without blocking and wake the EventLoop
thread. In this case, since `EventLoop` execution is done there is not really
any point to using `Unlock()` and relocking, so the code is changed to use a
simple `lock.unlock()` call and not try to relock.

The second case happens in `EventLoop::post` where `Unlock()` is used in a
similar way, and depending on thread scheduling (if the thread is delayed for a
long time before relocking) can result in failing to relock
`EventLoop::m_mutex` after calling `write()`. In this case, since relocking the
mutex is actually necessary for the code that follows, the fix is different:
new `addClient`/`removeClient` calls are added to this code, so the
`EventLoop::loop()` function will never exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in the libmultiprocess code or API. The `EventLoop`
object itself was safe before these changes as long as outside code waited for
`EventLoop` methods to finish executing before deleting the `EventLoop` object.
Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added for binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned, it meant this code could now start failing to
relock after unlocking.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

This first case described above was not observed but is theoretically possible.
The second case happens intermittently on macos and was reported
bitcoin-core#154
ryanofsky added a commit to bitcoin-core/libmultiprocess that referenced this pull request Feb 10, 2025
48d01bc bugfix: Do not lock EventLoop::mutex after EventLoop is done (Ryan Ofsky)

Pull request description:

  Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it.

  The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock.

  The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting.

  These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in
  bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking.

  A previous attempt was made to fix this bug in
  bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

  This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154

Top commit has no ACKs.

Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
@ryanofsky
Copy link
Contributor Author

I reproduced the bug locally (see bitcoin-core/libmultiprocess#154 (comment)) and implemented a different fix in bitcoin-core/libmultiprocess#159, which should make these changes unnecessary, because instead of trying to delay eventloop destruction in Bitcoin code until all locks are released, it just avoids reacquiring a lock unnecessary and pretents eventloop::loop() from returning until all locks are released so it won't be destroyed. Will close this PR since these changes only incompletely fixed the bug and should no longer be needed.

@ryanofsky ryanofsky closed this Feb 10, 2025
@Sjors
Copy link
Member

Sjors commented Feb 10, 2025

Thanks. I dropped this from #30975 and rebased on your latest #31741 which contains bitcoin-core/libmultiprocess#159.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants