-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Windows bitcoind stall debugging [NOMERGE, DRAFT] #30956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
d4d31d4
to
8059665
Compare
https://github.com/bitcoin/bitcoin/actions/runs/11014291937/job/30584604403?pr=30956#step:10:930 @maflcko continuing discussion after your comment in the related issue:
Feel free to loop in others to answer if you think it's more their area of ownership. |
I think you can add something like: - uses: actions/upload-artifact@v4
with:
name: dump
path: %RUNNER_TEMP%/**/bitcoind.dmp
if-no-files-found: ignore after the https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow |
Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto |
4d4281c
to
4406f9c
Compare
https://github.com/bitcoin/bitcoin/actions/runs/11030752196 gives me: actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*. Maybe one could (ab)use actions/cache/save but I'll pause here for more input. |
Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually? |
d759010
to
2fa2e45
Compare
Here is some historical context:
At this point, I don't see any drawbacks to switching the CI job's build configuration from "Release" to "RelWithDebInfo". |
0789cfa
to
dea921e
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
dea921e
to
636fa1b
Compare
https://github.com/hodlinator/bitcoin/actions/runs/11069726468/job/30757882453#step:15:114 : The Procdump.exe process itself times out when trying to produce the dump. 🤦 |
b302f28
to
3ef1236
Compare
Partial victory! - Latest run https://github.com/hodlinator/bitcoin/actions/runs/11087284207 actually uploaded a .DMP file (but it wasn't a real stall, just forced it to happen). Still remains to see if it loads nicely in the debugger. Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I'm still struggling to find the correct glob-expression. Can't assume it will be node0 that stalls, and unsure if the path will be the same when removing --nocleanup. (Come to think of it, --nocleanup shouldn't be necessary now that the test fails). - name: Upload .DMP file
uses: actions/upload-artifact@v4
id: dmp-upload
if: ${{ failure() && steps.functional-tests.conclusion == 'failure' }}
with:
name: bitcoind-windows-dmp-file
path: D:/a/bitcoin/bitcoin/build/test/cache/node0/bitcoind.dmp I'm a bit mystified by RUNNER_TEMP passed as |
c35e307
to
7bc12f5
Compare
We have a winner! https://github.com/hodlinator/bitcoin/actions/runs/11311779692/job/31458399560 |
Had a few other cases. First 2 are confirmed to be the same issue - seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code. Turned off the once-per-hour CI schedule on my master-branch for now. More work to follow next week. |
The call to Lines 80 to 91 in e8f72ae
|
good find ! i couldn't find anything wrong with that code; however i looked into the Microsoft documentation for this: https://learn.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data apparently, querying HKEY_PERFORMANCE_DATA can potentially load all kinds of DLLs for performance counter data, depending on drivers installed on the system, as well as collects performance information for every process and thread on the system (could be a lot on busy systems) it's very possible that one of these has a race condition, or other crash bug, in the configuration used in the CI if we have enough other random sources in Windows (i haven't checked this) then we could consider disabling this one, it was more of a workaround when there were no official OS entropy sources does disabling it fix the stalls? |
I think you can open a PR to just remove this entirely. |
On it. Will explore other possible fixes after publishing that PR. |
PR: #31124 |
Debugged a 3rd dump: Main thread
PerfdiskPnpNotification thread
This was the only thread not stuck in NtReleaseSemaphore/NtWaitFor*. Another project also encountered a hang with
Linked post uses the PDH API, which internally queries the registry (seen in linked posts' callstack). Seems like their case was solved by moving the calling code out of Remote Access Manager thread
Possibly a side effect of running ProcDump to capture the .DMP file from the sound of Rem0o/FanControl.Releases#101 (comment). Other threads
2 each of
This might indicate there are lingering threads from a previous loop iteration that were not cleaned up yet. Despite the official example not calling Alternative Performance Counter APIshttps://learn.microsoft.com/en-us/windows/win32/perfctrs/about-performance-counters#performance-api-architecture <- Has a handy diagram of the 3 consumer-facing apis. It states:
The main callstack above with our hang contains Log performance counter data?https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regqueryvalueexa:
It is unclear whether we get anything useful back in the failure case, but we could add code to see if
RegCloseKey investigation
It is not entirely clear whether it should be called after every https://learn.microsoft.com/en-us/windows/win32/sysinfo/predefined-keys also mentions:
and
So there's some mixed messaging around closing.
https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regopenkeyexw explicitly says:
Closing after the loop happens after the issue covered here, so seems to be not too often unless there's some inter-CI-run interference. Or maybe inter-process interference is possible and another unknown process on CI is closing the handle? Could write a minimal test case of two processes querying the registry and closing handles on each other forever. :\ As mentioned above with the case of duplicate thread callstacks, it might be fruitful to attempt closing in every iteration of the loop. |
🐙 This pull request conflicts with the target branch and needs rebase. |
FWIW the idea that i got from reading that was that you should close the key when done with the API. Not at any specific point. This will unload the performance collection DLLs. The next time you query the key again they will be loaded again. i concluded that we do the right thing (never close it) because, we're never done with it, we keep queryiing them time after time after time, so might as well keep the DLLs in memory. But yes it's ambiguous. |
9bb92c0 util: Remove RandAddSeedPerfmon (Hodlinator) Pull request description: `RegQueryValueExA(HKEY_PERFORMANCE_DATA, ...)` sometimes hangs *bitcoind.exe* on Windows during startup, at least on CI. We have other sources of entropy to seed randomness with on Windows, so should be alright removing this. Might resurrect if less drastic fix is found. Hopefully sufficient to fix #30390. CI debugged with temporary Windows stack trace dumping + Symbols in #30956. ACKs for top commit: achow101: ACK 9bb92c0 fanquake: ACK 9bb92c0 hebasto: ACK 9bb92c0, I have reviewed the code and it looks OK. laanwj: Code review ACK 9bb92c0 Tree-SHA512: d3f26b4dd0519ef957f23abaffc6be1fed339eae756aed18042422fc6f0bba4e8fa9a44bf903e54f72747e2d0108146c18fd80576d95fc20690a2daf9c83689d
Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master. Intend to experiment with more surgical solutions and document that here in the future. |
Would it make sense to leave it running for a little longer to validate that the fix worked? |
My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix. If it re-occurs hopefully one of us will reopen #30390. But I don't have a well-developed sense for how much strain is reasonable to put on the CI resources. |
Oh, didn't think of that, makes sense to me. |
Can this be closed, or is it waiting on something? |
Goal is to verify more surgical fix to issue with stalled bitcoind within 1-2.
7bc12f5
to
f69a238
Compare
I'd like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: hodlinator/bitcoin@84cd647...f69a238 |
In my opinion your (already merged) change to drop the offending code from the RNG is fine, and isn't really something we need to investigate further / work on reintroducing. |
Agree, it's fine, it's good to get rid of the perfmon query. |
Alright, our time is finite and hopefully |
This PR is meant to help investigate #30390 (#30390 (comment)).
Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it's gotten stuck.
Unresolved:
Whether the CI change is remotely correct.How to get access to the generated .EXE, .DMP and .PDB files.Obtaining a .DMP artifact(s).Verify debugging with downloaded artifacts works.Making the GitHub workflow begin execution based on schedule rather than based on git push.Undo most temporary changes and let CI run every 3 hours on this branchon personal master branch for 2 weeks.