Skip to content

Conversation

laanwj
Copy link
Member

@laanwj laanwj commented May 16, 2025

Alternate implementation of fd iteration for Linux.

It looks like std::filesystem::directory_iterator, introduced in #32343, has an intermittent issue that causes a hang of the CI (#32524). i have not been able to reproduce this locally, and the CI situation makes it difficult to get more details about the problem.

So to work around that, this PR replaces the use of directory_iterator with the equivalent POSIX opendir/readdir//closedir usage. It is only slightly more verbose. In all testing up to now this appears to make the problem go away.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 16, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32529.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

@laanwj laanwj force-pushed the 2025-05-fdclose-debug-ci-hang branch from 20db17c to 8097d29 Compare May 16, 2025 12:57
@laanwj
Copy link
Member Author

laanwj commented May 16, 2025

ARM run failure is interesting:

fatal error: in "system_tests/run_command": subprocess::CalledProcessError: /proc/<pid>/fd iteration failed: No such file or directory (2)

(proabably /proc isn't mounted there, but why wasn't it erroring with the previous implementation)

Edit: i think this was an implementation problem, according to the documentation on the horrors of POSIX, errno needs to be set to zero explicitly to be able to detect errors during readdir.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task no wallet, libbitcoinkernel: https://github.com/bitcoin/bitcoin/runs/42360298658
LLM reason (✨ experimental): The CI failure is due to the "system_tests" failing.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@laanwj laanwj force-pushed the 2025-05-fdclose-debug-ci-hang branch 2 times, most recently from ee39b69 to 21c9fae Compare May 16, 2025 21:26
@maflcko
Copy link
Member

maflcko commented May 17, 2025

I haven't looked at the code, but the CI passed 50 times: https://cirrus-ci.com/task/6662743735926784

@laanwj
Copy link
Member Author

laanwj commented May 17, 2025

I haven't looked at the code, but the CI passed 50 times: https://cirrus-ci.com/task/6662743735926784

Okay. So unless the original code is using directory_iterator wrong (which i don't think so), we can be reasonably sure there's an intermittent bug in the implementation of it on at least some libc++ versions.

i'll clean up the code ~monday and open it for review.

@laanwj laanwj force-pushed the 2025-05-fdclose-debug-ci-hang branch from 21c9fae to 7177fbe Compare May 18, 2025 06:31
@laanwj laanwj changed the title [DONOTMERGE] subprocess: replace fs::directory_iterator with readdir subprocess: replace fs::directory_iterator with readdir May 19, 2025
@laanwj laanwj closed this May 19, 2025
@laanwj
Copy link
Member Author

laanwj commented May 19, 2025

This isn't necessary if it's worked around some other way #32524 (comment)

fanquake added a commit that referenced this pull request Jul 26, 2025
…RunCommandJSON"

faa1c3e Revert "Merge #32343: common: Close non-std fds before exec in RunCommandJSON" (MarcoFalke)

Pull request description:

  After a fork() in a multithreaded program, the child can safely
  call only async-signal-safe functions (see [signal-safety(7)](https://www.man7.org/linux/man-pages/man7/signal-safety.7.html))
  until such time as it calls execv.

  The standard library (`std` namespace) is not async-signal-safe. Also, `throw`, isn't.

  There was an alternative implementation using `readdir` (#32529), but that isn't async-signal-safe either, and that implementation was still using `throw`.

  So temporarily revert this feature.

  A follow-up in the future can add it back, using only async-signal-safe functions, or by using a different approach.

  Fixes #32524
  Fixes #33015
  Fixes #32855

  For reference, a failure can manifest in the GCC debug mode:

  * While `fork`ing, a debug mode mutex is held (by any other thread).
  * The `fork`ed child tries to use the stdard libary before `execv` and deadlocks.

  This may look like the following:

  ```
  (gdb) thread apply all bt

  Thread 1 (Thread 0xf58f4b40 (LWP 774911) "b-httpworker.2"):
  #0  0xf7f4f589 in __kernel_vsyscall ()
  #1  0xf79e467e in ?? () from /lib32/libc.so.6
  #2  0xf79eb582 in pthread_mutex_lock () from /lib32/libc.so.6
  #3  0xf7d93bf2 in ?? () from /lib32/libstdc++.so.6
  #4  0xf7d93f36 in __gnu_debug::_Safe_iterator_base::_M_attach(__gnu_debug::_Safe_sequence_base*, bool) () from /lib32/libstdc++.so.6
  #5  0x5668810a in __gnu_debug::_Safe_iterator_base::_Safe_iterator_base (this=0xf58f13ac, __seq=0xf58f13f8, __constant=false) at /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/debug/safe_base.h:91
  #6  0x56ddfb50 in __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<int*, std::__cxx1998::vector<int, std::allocator<int> > >, std::__debug::vector<int, std::allocator<int> >, std::forward_iterator_tag>::_Safe_iterator (this=0xf58f13a8, __i=3, __seq=0xf58f13f8) at /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/debug/safe_iterator.h:162
  #7  0x56ddfacb in __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<int*, std::__cxx1998::vector<int, std::allocator<int> > >, std::__debug::vector<int, std::allocator<int> >, std::bidirectional_iterator_tag>::_Safe_iterator (this=0xf58f13a8, __i=3, __seq=0xf58f13f8) at /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/debug/safe_iterator.h:539
  #8  0x56ddfa5b in __gnu_debug::_Safe_iterator<__gnu_cxx::__normal_iterator<int*, std::__cxx1998::vector<int, std::allocator<int> > >, std::__debug::vector<int, std::allocator<int> >, std::random_access_iterator_tag>::_Safe_iterator (this=0xf58f13a8, __i=3, __seq=0xf58f13f8) at /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/debug/safe_iterator.h:687
  #9  0x56ddd3f6 in std::__debug::vector<int, std::allocator<int> >::begin (this=0xf58f13f8) at /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/debug/vector:300
  #10 0x57d83701 in subprocess::detail::Child::execute_child (this=0xf58f156c) at ./util/subprocess.h:1372
  #11 0x57d80a7c in subprocess::Popen::execute_process (this=0xf58f1cd8) at ./util/subprocess.h:1231
  #12 0x57d6d2b4 in subprocess::Popen::Popen<subprocess::input, subprocess::output, subprocess::error, subprocess::close_fds> (this=0xf58f1cd8, cmd_args="fake.py enumerate", args=..., args=..., args=..., args=...) at ./util/subprocess.h:964
  #13 0x57d6b597 in RunCommandParseJSON (str_command="fake.py enumerate", str_std_in="") at ./common/run_command.cpp:27
  #14 0x57a90547 in ExternalSigner::Enumerate (command="fake.py", signers=std::__debug::vector of length 0, capacity 0, chain="regtest") at ./external_signer.cpp:28
  #15 0x56defdab in enumeratesigners()::$_0::operator()(RPCHelpMan const&, JSONRPCRequest const&) const (this=0xf58f2ba0, self=..., request=...) at ./rpc/external_signer.cpp:51
  ...
  (truncated, only one thread exists)
  ```

ACKs for top commit:
  fanquake:
    ACK faa1c3e
  darosior:
    ACK faa1c3e

Tree-SHA512: 602da5f2eba08d7fe01ba19baf411e287ae27fe2d4b82f41734e05b7b1d938ce94cc0041e86ba677284fa92838e96ebee687023ff28047e2b036fd9a53567e0a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants