Skip to content

Conversation

hodlinator
Copy link
Contributor

@hodlinator hodlinator commented Sep 24, 2024

This PR is meant to help investigate #30390 (#30390 (comment)).

Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it's gotten stuck.

Unresolved:

  • Whether the CI change is remotely correct.
  • How to get access to the generated .EXE, .DMP and .PDB files.
  • Obtaining a .DMP artifact(s).
  • Verify debugging with downloaded artifacts works.
  • Making the GitHub workflow begin execution based on schedule rather than based on git push.
  • Undo most temporary changes and let CI run every 3 hours on this branch on personal master branch for 2 weeks.
  • Compare execution time of release vs RelWithDebInfo to see if defaults could be changed on master branch.

@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 24, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #30660 (test: Shut down framework cleanly on RPC connection failure by hodlinator)
  • #30487 (ci: skip Github CI on branch pushes for forks by Sjors)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@hodlinator
Copy link
Contributor Author

https://github.com/bitcoin/bitcoin/actions/runs/11014291937/job/30584604403?pr=30956#step:10:930
contains:
bitcoind.vcxproj -> D:\a\bitcoin\bitcoin\build\src\RelWithDebInfo\bitcoind.exe
...notice the "RelWithDebInfo", so it should be generating .PDB files like it does for me locally.

@maflcko continuing discussion after your comment in the related issue:

Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.

  1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

  2. Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI? I'm totally unfamiliar with GHA so RTFM-links are welcome too. An alternative approach could be to make this temporary PR upload the relevant files to somewhere, only if the issue occurs.

Feel free to loop in others to answer if you think it's more their area of ownership.

@davidgumberg
Copy link
Contributor

davidgumberg commented Sep 24, 2024

Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI?

I think you can add something like:

- uses: actions/upload-artifact@v4
  with:
    name: dump
    path: %RUNNER_TEMP%/**/bitcoind.dmp
    if-no-files-found: ignore

after the run: part of each job.

https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow
https://github.com/actions/upload-artifact#examples

@maflcko
Copy link
Member

maflcko commented Sep 25, 2024

  1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

@hodlinator
Copy link
Contributor Author

https://github.com/bitcoin/bitcoin/actions/runs/11030752196 gives me:

actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*.

Maybe one could (ab)use actions/cache/save but I'll pause here for more input.

@maflcko
Copy link
Member

maflcko commented Sep 25, 2024

Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually?

@hodlinator hodlinator force-pushed the 30390_windows_debug branch 3 times, most recently from d759010 to 2fa2e45 Compare September 25, 2024 20:09
@hebasto
Copy link
Member

hebasto commented Sep 26, 2024

  1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

Here is some historical context:

  1. In pre-Cmake times, manually crafted project files had two configurations: "Release" and "Debug".
  2. While working on the CMake staging branch, the CI build configuration remained "Release" to allow the use of Ccache. This functionality was later dropped.

At this point, I don't see any drawbacks to switching the CI job's build configuration from "Release" to "RelWithDebInfo".

@hodlinator hodlinator force-pushed the 30390_windows_debug branch 5 times, most recently from 0789cfa to dea921e Compare September 27, 2024 11:04
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/30753975513

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@hodlinator
Copy link
Contributor Author

https://github.com/hodlinator/bitcoin/actions/runs/11069726468/job/30757882453#step:15:114 :
subprocess.TimeoutExpired: Command '['D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\Procdump.exe', '-mm', '7904', 'D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\bitcoind.dmp']' timed out after 4800 seconds

The Procdump.exe process itself times out when trying to produce the dump. 🤦
Could be that it's popping up a GUI dialogue on first run, will investigate further.

@hodlinator hodlinator force-pushed the 30390_windows_debug branch 5 times, most recently from b302f28 to 3ef1236 Compare September 28, 2024 21:55
@hodlinator
Copy link
Contributor Author

Partial victory! - Latest run https://github.com/hodlinator/bitcoin/actions/runs/11087284207 actually uploaded a .DMP file (but it wasn't a real stall, just forced it to happen).

Still remains to see if it loads nicely in the debugger.

Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I'm still struggling to find the correct glob-expression. Can't assume it will be node0 that stalls, and unsure if the path will be the same when removing --nocleanup. (Come to think of it, --nocleanup shouldn't be necessary now that the test fails).

      - name: Upload .DMP file
        uses: actions/upload-artifact@v4
        id: dmp-upload
        if: ${{ failure() && steps.functional-tests.conclusion == 'failure' }}
        with:
          name: bitcoind-windows-dmp-file
          path: D:/a/bitcoin/bitcoin/build/test/cache/node0/bitcoind.dmp

I'm a bit mystified by RUNNER_TEMP passed as --tmpdirprefix to the functional test runner having the value D:/a/_temp/, while the test data seems to end up in D:/a/bitcoin/bitcoin/build/test/cache/, but maybe the cache part is a clue.

@hodlinator
Copy link
Contributor Author

Still remains to see if it loads nicely in the debugger.

It debugs!

image

Requires .DMP, .PDB and .EXE (last one currently luckily added just as an extra test).

@hodlinator
Copy link
Contributor Author

We have a winner! https://github.com/hodlinator/bitcoin/actions/runs/11311779692/job/31458399560
Downloaded the artifacts now just to be sure, but I won't be able to look at them until early next week.
Might keep the CI schedule running for one more dump before turning it off unless anyone disagrees.

@hodlinator
Copy link
Contributor Author

@hodlinator
Copy link
Contributor Author

Had a few other cases. First 2 are confirmed to be the same issue - seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code. Turned off the once-per-hour CI schedule on my master-branch for now. More work to follow next week.

@davidgumberg
Copy link
Contributor

seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code.

The call to RegQueryValueExA in RandAddSeedPerfmon:

std::vector<unsigned char> vData(250000, 0);
long ret = 0;
unsigned long nSize = 0;
const size_t nMaxSize = 10000000; // Bail out at more than 10MB of performance data
while (true) {
nSize = vData.size();
ret = RegQueryValueExA(HKEY_PERFORMANCE_DATA, "Global", nullptr, nullptr, vData.data(), &nSize);
if (ret != ERROR_MORE_DATA || vData.size() >= nMaxSize)
break;
vData.resize(std::min((vData.size() * 3) / 2, nMaxSize)); // Grow size of buffer exponentially
}
RegCloseKey(HKEY_PERFORMANCE_DATA);

@laanwj
Copy link
Member

laanwj commented Oct 20, 2024

The call to RegQueryValueExA in RandAddSeedPerfmon:

good find !

i couldn't find anything wrong with that code; however i looked into the Microsoft documentation for this: https://learn.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data

apparently, querying HKEY_PERFORMANCE_DATA can potentially load all kinds of DLLs for performance counter data, depending on drivers installed on the system, as well as collects performance information for every process and thread on the system (could be a lot on busy systems)

it's very possible that one of these has a race condition, or other crash bug, in the configuration used in the CI

if we have enough other random sources in Windows (i haven't checked this) then we could consider disabling this one, it was more of a workaround when there were no official OS entropy sources

does disabling it fix the stalls?

@fanquake
Copy link
Member

The call to RegQueryValueExA in RandAddSeedPerfmon:

I think you can open a PR to just remove this entirely.

@hodlinator
Copy link
Contributor Author

I think you can open a PR to just remove this entirely.

On it. Will explore other possible fixes after publishing that PR.

@hodlinator
Copy link
Contributor Author

PR: #31124

@hodlinator
Copy link
Contributor Author

Debugged a 3rd dump:
https://github.com/hodlinator/bitcoin/actions/runs/11414623090
Same stacks in threads and local function state in RandAddSeedPerfmon() was similar/same as observed in 2 previous dump files - nSize = 843750 (250000 × (1.5^3)).

Main thread

 	ntdll.dll!NtReleaseSemaphore()	Unknown
 	KERNELBASE.dll!ReleaseSemaphore()	Unknown
 	WmiApRpl.dll!WmiReverseGuard::LeaveRead(long)	Unknown
 	WmiApRpl.dll!WmiReverseMemoryExt<class WmiReverseGuard>::Read(void *,unsigned long,unsigned long *,unsigned long)	Unknown
 	WmiApRpl.dll!WmiAdapterWrapper::GetValidity(unsigned long)	Unknown
 	WmiApRpl.dll!WmiAdapterWrapper::CollectObjects(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
 	WmiApRpl.dll!WmiAdapterWrapper::Collect(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
 	WmiApRpl.dll!WmiCollectPerfData(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
 	advapi32.dll!CallExtObj()	Unknown
 	advapi32.dll!QueryV1Provider()	Unknown
 	advapi32.dll!QueryExtensibleData()	Unknown
 	advapi32.dll!PerfRegQueryValueEx()	Unknown
 	advapi32.dll!PerfRegQueryValue()	Unknown
 	KERNELBASE.dll!BaseRegQueryValueInternal()	Unknown
>	KERNELBASE.dll!RegQueryValueExA()	Unknown
 	bitcoind.exe!`anonymous namespace'::RandAddSeedPerfmon(CSHA512 & hasher) Line 86	C++
 	bitcoind.exe!RandAddDynamicEnv(CSHA512 & hasher) Line 235	C++
 	bitcoind.exe!`anonymous namespace'::SeedStartup(CSHA512 & hasher, `anonymous-namespace'::RNGState & rng) Line 624	C++
 	bitcoind.exe!`anonymous namespace'::ProcRand(unsigned char * out, int num, `anonymous-namespace'::RNGLevel level, bool always_use_real_rng) Line 662	C++
 	[Inline Frame] bitcoind.exe!kernel::Context::{ctor}::__l2::<lambda_1>::operator()() Line 21	C++
 	[Inline Frame] bitcoind.exe!std::invoke(kernel::Context::{ctor}::__l2::<lambda_1> &&) Line 1704	C++
 	bitcoind.exe!std::call_once<`kernel::Context::Context'::`2'::<lambda_1>>(std::once_flag & _Once, kernel::Context::{ctor}::__l2::<lambda_1> && _Fx) Line 103	C++
 	bitcoind.exe!kernel::Context::Context() Line 23	C++
 	[Inline Frame] bitcoind.exe!std::make_unique() Line 3597	C++
 	bitcoind.exe!AppInit(node::NodeContext & node) Line 188	C++
 	bitcoind.exe!main(int argc, char * * argv) Line 277	C++
 	[Inline Frame] bitcoind.exe!invoke_main() Line 78	C++
 	bitcoind.exe!__scrt_common_main_seh() Line 288	C++
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

PerfdiskPnpNotification thread

 	win32u.dll!NtUserGetMessage()	Unknown
 	user32.dll!GetMessageW()	Unknown
>	perfdisk.dll!PerfdiskPnpNotification(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

This was the only thread not stuck in NtReleaseSemaphore/NtWaitFor*.

Another project also encountered a hang with [_]PerfdiskPnpNotification in another thread.
https://stackoverflow.com/questions/15730185/performance-data-handler-hangs-for-physicaldisk

Googling for _PerfdiskPnpNotification finds it sometimes tries to open a pop up window.

Linked post uses the PDH API, which internally queries the registry (seen in linked posts' callstack). Seems like their case was solved by moving the calling code out of DllMain. Would be weird if us calling Perf Counter functions from inside our main() was unsupported though.

Remote Access Manager thread

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
>	rasctrs.dll!RasmanServiceMonitor()	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

Possibly a side effect of running ProcDump to capture the .DMP file from the sound of Rem0o/FanControl.Releases#101 (comment).

Other threads

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
>	aspnet_perf.dll!CPerfCounterClient::GatherPerfData(void)	Unknown
 	aspnet_perf.dll!PerfDataGatherThreadStart(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown
 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
 	sechost.dll!EtwpProcessRealTimeTraces()	Unknown
 	sechost.dll!ProcessTrace()	Unknown
>	perfts.dll!BlockOnProcessTrace(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown
 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
 	KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
 	KERNELBASE.dll!WaitForMultipleObjects()	Unknown
>	perfos.dll!StandbyMonitorThreadProc(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown
 	ntdll.dll!NtWaitForWorkViaWorkerFactory()	Unknown
>	ntdll.dll!TppWorkerThread()	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

2 each of

 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
 	KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
 	KERNELBASE.dll!WaitForMultipleObjects()	Unknown
>	CorperfmonExt.dll!HandlerAuxThreadProc(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown
 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
>	aspnet_perf.dll!CPerfCounterClient::MonitorPerfPipeNames(void)	Unknown
 	aspnet_perf.dll!RegistryMonitorThreadStart(void *)	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

This might indicate there are lingering threads from a previous loop iteration that were not cleaned up yet. Despite the official example not calling RegCloseKey at all, maybe we should try doing it for every loop iteration to see if that fixes it.

Alternative Performance Counter APIs

https://learn.microsoft.com/en-us/windows/win32/perfctrs/about-performance-counters#performance-api-architecture <- Has a handy diagram of the 3 consumer-facing apis. It states:

Some older performance counter consumers use the registry APIs to collect performance data from the special HKEY_PERFORMANCE_DATA registry key. This is not recommended for new code because processing the data from the registry is complex and error-prone. The registry API implementation directly supports collecting data from V1 providers. It indirectly supports collecting data from V2 providers through a translation layer that uses the V2 consumer APIs.

Some performance counter consumers use the PerfLib V2 Consumer functions to directly access data from V2 providers. This is more complex than consuming data using the PDH APIs, but this approach can be useful if PDH APIs cannot be used due to performance or dependency concerns. The PerfLib V2 implementation directly supports collecting data from V2 providers. It does not support collecting data from V1 providers.

The main callstack above with our hang contains QueryV1Provider. Makes me think maybe we should migrate to the more modern V2 API that does not go through the Registry layer.

Log performance counter data?

https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regqueryvalueexa:

If the buffer specified by lpData parameter is not large enough to hold the data, the function returns ERROR_MORE_DATA and stores the required buffer size in the variable pointed to by lpcbData. In this case, the contents of the lpData buffer are undefined.

It is unclear whether we get anything useful back in the failure case, but we could add code to see if lpData (PPERF_DATA_BLOCK) has anything usable. Maybe it would provide hints at what is different from successful runs and causing the failures.

RegCloseKey investigation

https://learn.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data
States:

Be sure to use the RegCloseKey function to close the handle to the key when you are finished obtaining the performance data. This is important for both the local and remote cases:

  • RegCloseKey(HKEY_PERFORMANCE_DATA) does not actually close a registry handle, but it clears all cached data and frees the loaded performance DLLs.
  • RegCloseKey(hkeyRemotePerformanceData) closes the handle to the remote machine's registry.

It is not entirely clear whether it should be called after every RegQueryValueExA or if it's fine as-is, after the loop.

https://learn.microsoft.com/en-us/windows/win32/sysinfo/predefined-keys also mentions:

The system defines predefined keys that are always open.

and

The predefined handles are not thread safe. Closing a predefined handle in one thread affects any other threads that are using the handle.

So there's some mixed messaging around closing.

RandAddPeriodic (documented by us to be thread-safe) is called with one minute between every call, so shouldn't require a mutex even on sluggish machines. (We seed startup randomness before scheduling the periodic re-seeding).

https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regopenkeyexw explicitly says:

If the key is not one of the predefined registry keys, call the RegCloseKey function after you have finished using the handle.

Closing after the loop happens after the issue covered here, so seems to be not too often unless there's some inter-CI-run interference. Or maybe inter-process interference is possible and another unknown process on CI is closing the handle? Could write a minimal test case of two processes querying the registry and closing handles on each other forever. :\

As mentioned above with the case of duplicate thread callstacks, it might be fruitful to attempt closing in every iteration of the loop.

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

@laanwj
Copy link
Member

laanwj commented Oct 24, 2024

It is not entirely clear whether it should be called after every RegQueryValueExA or if it's fine as-is, after the loop.

FWIW the idea that i got from reading that was that you should close the key when done with the API. Not at any specific point. This will unload the performance collection DLLs. The next time you query the key again they will be loaded again.

i concluded that we do the right thing (never close it) because, we're never done with it, we keep queryiing them time after time after time, so might as well keep the DLLs in memory.

But yes it's ambiguous.

achow101 added a commit that referenced this pull request Oct 24, 2024
9bb92c0 util: Remove RandAddSeedPerfmon (Hodlinator)

Pull request description:

  `RegQueryValueExA(HKEY_PERFORMANCE_DATA, ...)` sometimes hangs *bitcoind.exe* on Windows during startup, at least on CI.

  We have other sources of entropy to seed randomness with on Windows, so should be alright removing this. Might resurrect if less drastic fix is found.

  Hopefully sufficient to fix #30390.

  CI debugged with temporary Windows stack trace dumping + Symbols in #30956.

ACKs for top commit:
  achow101:
    ACK 9bb92c0
  fanquake:
    ACK 9bb92c0
  hebasto:
    ACK 9bb92c0, I have reviewed the code and it looks OK.
  laanwj:
    Code review ACK  9bb92c0

Tree-SHA512: d3f26b4dd0519ef957f23abaffc6be1fed339eae756aed18042422fc6f0bba4e8fa9a44bf903e54f72747e2d0108146c18fd80576d95fc20690a2daf9c83689d
@hodlinator
Copy link
Contributor Author

hodlinator commented Oct 24, 2024

Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.

Intend to experiment with more surgical solutions and document that here in the future.

@davidgumberg
Copy link
Contributor

Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.

Would it make sense to leave it running for a little longer to validate that the fix worked?

@hodlinator
Copy link
Contributor Author

Would it make sense to leave it running for a little longer to validate that the fix worked?

My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix. If it re-occurs hopefully one of us will reopen #30390. But I don't have a well-developed sense for how much strain is reasonable to put on the CI resources.

@davidgumberg
Copy link
Contributor

My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix.

Oh, didn't think of that, makes sense to me.

@maflcko
Copy link
Member

maflcko commented Oct 28, 2024

Can this be closed, or is it waiting on something?

@hodlinator
Copy link
Contributor Author

Can this be closed, or is it waiting on something?

I'd like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: hodlinator/bitcoin@84cd647...f69a238

@fanquake
Copy link
Member

I'd like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: hodlinator/bitcoin@84cd647...f69a238

In my opinion your (already merged) change to drop the offending code from the RNG is fine, and isn't really something we need to investigate further / work on reintroducing.

@laanwj
Copy link
Member

laanwj commented Oct 29, 2024

isn't really something we need to investigate further / work on reintroducing.

Agree, it's fine, it's good to get rid of the perfmon query.

@hodlinator
Copy link
Contributor Author

Alright, our time is finite and hopefully CryptGenRandom should be enough for now. Closing.

@hodlinator hodlinator closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants