Skip to content

Conversation

hebasto
Copy link
Member

@hebasto hebasto commented Sep 14, 2023

On the master branch, when compiling without external signer support, the bench_bitcoin.exe does not initialize Winsock DLL that is required, for example, here:

const CNetAddr source{LookupHost("252.2.2.2", false).value()};

Moreover, Windows docs explicitly state that WSAStartup and WSACleanup must be balanced:

There must be a call to WSACleanup for each successful call to WSAStartup. Only the final WSACleanup function call performs the actual cleanup. The preceding calls simply decrement an internal reference count in the WS2_32.DLL.

That is not the case for our unit tests because the SetupNetworking() call is a part of the BasicTestingSetup fixture and is invoked multiple times, while ~CNetCleanup() is invoked once only, at the end of the test binary execution.

This PR fixes Winsock DLL initialization and termination.

More docs:

Fix #28940.

@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 14, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK maflcko

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@DrahtBot DrahtBot added the Tests label Sep 14, 2023
@hebasto
Copy link
Member Author

hebasto commented Sep 14, 2023

For some reasons (I failed to figure them out, unfortunately) the code on the master branch works.

@sipsorcery Maybe you have any insights about that?

@theuni
Copy link
Member

theuni commented Sep 19, 2023

I'm having trouble understanding the problem here. Could you please explain what the resulting error is?

I can understand possibly needing to move SetupNetworking() to be more correct. But I don't understand what CMake has anything to do with this. IMO we need to figure out why before moving anything around.

@theuni
Copy link
Member

theuni commented Sep 19, 2023

Looking at your commit that purposefully introduces a bug, it removes a bunch of bench files. I'm guessing one of those happens to call WSAStartup somehow?

@hebasto
Copy link
Member Author

hebasto commented Sep 19, 2023

Looking at your commit that purposefully introduces a bug, it removes a bunch of bench files. I'm guessing one of those happens to call WSAStartup somehow?

That was my initial guess. But it is not the case. I mean, there are no explicit WSAStartup invocations.

Yes, in the MakeNoLogFileContext template function where the compiler decided to instantiate the BasicTestingSetup struct.

@theuni
Copy link
Member

theuni commented Sep 19, 2023

The error in that github action is:

Error: Bad optional access
Error: Process completed with exit code 1.

Which implies to me that something is throwing std::bad_optional_access. That seems to be something different from what you're talking about?

@hebasto
Copy link
Member Author

hebasto commented Sep 19, 2023

The error in that github action is:

Error: Bad optional access
Error: Process completed with exit code 1.

Which implies to me that something is throwing std::bad_optional_access. That seems to be something different from what you're talking about?

The call chain from here

const CNetAddr source{LookupHost("252.2.2.2", false).value()};
is as follows:

  • std::optional<CNetAddr> LookupHost(const std::string&, bool, DNSLookupFn)
  • std::vector<CNetAddr> LookupHost(const std::string&, unsigned int, bool, DNSLookupFn)
  • std::vector<CNetAddr> LookupIntern(const std::string&, unsigned int, bool, DNSLookupFn)
  • std::vector<CNetAddr> WrappedGetAddrInfo(const std::string&, bool)
  • getaddrinfo

If the latter fails, for example, due to uninitialized Winsock, the WrappedGetAddrInfo returns empty vector, and the first LookupHost does std::nullopt, which in turn causes Error: Bad optional access.

@hebasto
Copy link
Member Author

hebasto commented Sep 23, 2023

In this branch, I've minimized the problem code to a single line of code, which is:

    // Comment the next line out to trigger the WSANOTINITIALISED=10093 error.
    const auto test_setup = MakeNoLogFileContext<const TestingSetup>();

@fanquake
Copy link
Member

I still don't understand why our CI isn't failing, in the way you claim it should be, if our binaries are meant to be broken.

Nothing should be done here, until the root cause of the issue is actually understood. As this either points to some brokenness in master (build system/CI), or your CMake/MSVC branch producing binaries differently to master.

@hebasto
Copy link
Member Author

hebasto commented Sep 24, 2023

I still don't understand why our CI isn't failing, in the way you claim it should be, if our binaries are meant to be broken.

Our code is incorrect according to Microsoft's docs because the SetupNetworking (with WSAStartup in it) is never called in bench_bitcoin.exe binary.

@hebasto
Copy link
Member Author

hebasto commented Nov 25, 2023

I still don't understand why our CI isn't failing, in the way you claim it should be, if our binaries are meant to be broken.

Apparently, the underlying problem is not MSVC or CMake specific. I've opened a dedicated issue with steps to reproduce the bug.

We might adjust our CI to catch the problem easily.


The PR description has been updated.

@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

I'm going to address @maflcko's and @fanquake's recent comments shortly.

In the meantime, I've modified the "Win64, VS 2022" GHA CI job to make it run with both enabled and disabled external wallet support. This aims to prevent any similar regressions in the future.

@fanquake
Copy link
Member

In the meantime, I've modified the "Win64, VS 2022" GHA CI job to make it run with both enabled and disabled external wallet support.

We shouldn't (have to) do this going forward, because the solution is either to remove/disable Boost Process for Windows, or replace it with something that doesn't undermine our own code.

@hebasto hebasto changed the title bench: Initialize and terminate use of Winsock properly test, bench: Initialize and terminate use of Winsock properly Nov 28, 2023
@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

The PR has been reworked and its description has been updated.

In the meantime, I've modified the "Win64, VS 2022" GHA CI job to make it run with both enabled and disabled external wallet support.

We shouldn't (have to) do this going forward, because the solution is either to remove/disable Boost Process for Windows, or replace it with something that doesn't undermine our own code.

I agree with the replacement of the Boost.Process and going to pick up @theStack's branch.

Until then, it seems appropriate to have a regression test after fixing our own code. And it comes almost for free for us.

@fanquake
Copy link
Member

Until then, it seems appropriate to have a regression test after fixing our own code.

Thanks, but we aren't going to change our code in the way you are suggesting in this PR. External signer should just be disabled for Windows, until there is a replacement that works properly.

@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

Until then, it seems appropriate to have a regression test after fixing our own code.

Thanks, but we aren't going to change our code in the way you are suggesting in this PR. External signer should just be disabled for Windows, until there is a replacement that works properly.

Disabling external signer support is exactly the thing which exposes the fact that our code is broken. See #28940.

@fanquake
Copy link
Member

fanquake commented Nov 28, 2023

Yes, and we should keep the almost dangerous, poorly maintained code disabled until we replace it with something that has actually been reviewed, and works as we expect it to. Clearly that hasn't been the case previously. Obviously we also need to fix our networking code (as all our code should function without external signer), but I'm not sure the approach here, is a good one, and further complicates test/bench code, while diverging it from how bitcoind functions. Ideally things become more encapsulated in some lower-level networking/init code, that can be called from bitcoind/test/bench/whereever, we shouldn't have to introduce new special code into the test/bench frameworks to initialize sockets on windows.

it would also help to elaborate in the PR description what the underlying issue is (Boost ASIO/Process hijacking socket/network setup out from underneath us on Windows).

@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

What exactly do you mean by "diverging it from how bitcoind functions"?

it would also help to elaborate in the PR description what the underlying issue is (Boost ASIO/Process hijacking socket/network setup out from underneath us on Windows).

"Boost ASIO/Process hijacking socket/network setup" is not a cause of the bug, rather a reason why that bug was hidden from us.

@hebasto hebasto force-pushed the 230914-winsock branch 2 times, most recently from eb572fb to 89a52fd Compare November 28, 2023 18:54
@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

Addressed @maflcko's comment.

@hebasto hebasto force-pushed the 230914-winsock branch 2 times, most recently from 56da7e0 to 7560aa5 Compare November 28, 2023 18:59
@maflcko
Copy link
Member

maflcko commented Nov 28, 2023

lgtm ACK b22c3ac, didn't review the GHA diff

@hebasto
Copy link
Member Author

hebasto commented Nov 28, 2023

... didn't review the GHA diff

Just in case, here is a branch with the reverted first commit. Here are its CI results:

@fanquake
Copy link
Member

fanquake commented Nov 29, 2023

I don't understand how this solves the fact that the Boost ASIO code is still in bitcoind, running before main, and calling WSAStartup, with a version (2.0) that is different from what we want (2.2). There should not be random Boost code, initializing our networking stack, at all, let alone before we've even run any of our sanity checks/startup code.

@hebasto
Copy link
Member Author

hebasto commented Nov 29, 2023

I don't understand how this solves the fact that the Boost ASIO code is still in bitcoind, running before main/any of our init checks, and calling WSAStartup, with a version (2.0) that is different from what we want (2.2).

This PR does nothing about the fact you mentioned. It resolves another bug which is explicit when bitcoind does not contain Boost ASIO code, i.e. when building without external wallet support.

There should not be random Boost code, initializing our networking stack, at all, let alone before we've even run any of our sanity checks/startup code.

I agree with you.


In the added "Win64, VS 2022 (ext.signer=OFF)" CI job, there is no Boost ASIO code in bitcoind.exe. The benchmark fails without this PR fix, and it passes with one.

@fanquake
Copy link
Member

I agree with you.

So then why isn't this PR disabling external signer for Windows?

@maflcko
Copy link
Member

maflcko commented Nov 29, 2023

So then why isn't this PR disabling external signer for Windows?

I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

@fanquake
Copy link
Member

I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

Given what we know, I don't see why we wouldn't just disable it now. If it's going to be disabled, there's also no point in adding the additional CI checks in this PR.

@hebasto
Copy link
Member Author

hebasto commented Nov 29, 2023

I agree with you.

So then why isn't this PR disabling external signer for Windows?

Because this PR is focused on fixing a bug in our code, not on changing user-available functionality. The question you raised should be discussed in a dedicated issue/PR.

@hebasto
Copy link
Member Author

hebasto commented Nov 29, 2023

I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

Given what we know, I don't see why we wouldn't just disable it now. If it's going to be disabled, there's also no point in adding the additional CI checks in this PR.

The additional CI check proves that everything is working with external signer disabled.

@maflcko
Copy link
Member

maflcko commented Nov 29, 2023

there's also no point in adding the additional CI checks in this PR.

It the second commit is controversial, I guess it can be dropped. However, my understanding is that the first commit is correct and will be needed in any case.

@hebasto
Copy link
Member Author

hebasto commented Nov 29, 2023

It the second commit is controversial, I guess it can be dropped. However, my understanding is that the first commit is correct and will be needed in any case.

The second commit has been dropped.

@maflcko
Copy link
Member

maflcko commented Nov 29, 2023

lgtm ACK fd4c6a1

Copy link
Member

@fanquake fanquake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two non-blocking comments. I think we could improve SetupNetworking() and CNetCleanup further, and constrain the code to being Windows only, rather than having it exist for all platforms, but be a no-op on non-windows.

Will merge this shortly, and open a new PR to disable external signer for Windows, if nobody else does.

@@ -88,6 +89,15 @@ std::ostream& operator<<(std::ostream& os, const uint256& num)
return os;
}

struct NetworkSetup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use at least one line of doc explaining it's existence? Given it's only here to facilitate Windows sockets initialization, and doesn't do anything for non-windows platforms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use at least one line of doc explaining it's existence?

Mind suggesting one?

Given it's only here to facilitate Windows sockets initialization, and doesn't do anything for non-windows platforms.

With the current design of the SetupNetworking(), the code difference between platforms looks rather like implementation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code difference between platforms looks rather like implementation details.

The only code in SetupNetworking is Windows code to call WSAStartup()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

@@ -1218,6 +1218,9 @@ BOOST_AUTO_TEST_CASE(test_LockDirectory)
// has released the lock as we would expect by probing it.
int processstatus;
BOOST_CHECK_EQUAL(write(fd[1], &LockCommand, 1), 1);
// The following line invokes the ~CNetCleanup dtor without
// a paired SetupNetworking call. This is acceptable as long as
// ~CNetCleanup is a no-op for non-Windows platforms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now only because of the change above? It's not obvious why calling write() would invoke a NetCleanup destructor, or why we end up in a state where we are trying to tear down networking, if it's never even been setup. In general it seems fragile to rely on ifdefs and code being a no-op to avoid additional issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now only because of the change above?

This PR diff does not change behaviour of the commented LOC. However, it seems worth mentioning that currently the SetupNetworking and ~CNetCleanup calls might be unbalanced. Perhaps, we might consider moving network initialization after daemonizing? But on platforms that support daemonizing, both calls are no-op...

It's not obvious why calling write() would invoke a NetCleanup destructor...

Actually, it terminates the process with all consequences, including ~CNetCleanup call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it seems worth mentioning that currently the SetupNetworking and ~CNetCleanup calls might be unbalanced.

Maybe, but SetupNetworking and CNetCleanup only do anything on Windows, and the code here only runs for non-windows, where "unbalanced" calls to either doesn't matter, because there is no code being executed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... SetupNetworking and CNetCleanup only do anything on Windows, and the code here only runs for non-windows, where "unbalanced" calls to either doesn't matter, because there is no code being executed.

That is similar to what the comment says, right?

Copy link
Member

@fanquake fanquake Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point is that we shouldn't need a comment like this, or it's not clear what value it provides. I would think it makes more sense to try and fix things so that we just don't have situations where we try and call network teardown code if we haven't setup any networking, rather than adding comments about Windows for code that only runs on Linux.

fanquake added a commit to fanquake/bitcoin that referenced this pull request Nov 29, 2023
It's come to light that Boost ASIO (a Boost Process sub dep) has in some
instances, been queitly initialising our network stack on Windows (see
PR bitcoin#28486 and discussion in bitcoin#28940).

This has been shielding a bug in our own code, but the larger issue
is that Boost Process/ASIO is running code before main, and doing things
like setting up networking. This undermines our own assumptions about
how our binary works, happens before we get to run any sanity checks,
and also runs before we call our own code to setup networking.

It's also not clear why a feature like external signer would have a
dependency that would be doing anything network/socket related, given it
only exists to spawn a local process.
fanquake added a commit to fanquake/bitcoin that referenced this pull request Nov 29, 2023
It's come to light that Boost ASIO (a Boost Process sub dep) has in some
instances, been queitly initialising our network stack on Windows (see
PR bitcoin#28486 and discussion in bitcoin#28940).

This has been shielding a bug in our own code, but the larger issue
is that Boost Process/ASIO is running code before main, and doing things
like setting up networking. This undermines our own assumptions about
how our binary works, happens before we get to run any sanity checks,
and also runs before we call our own code to setup networking.

It's also not clear why a feature like external signer would have a
dependency that would be doing anything network/socket related, given it
only exists to spawn a local process.
fanquake added a commit to fanquake/bitcoin that referenced this pull request Nov 29, 2023
It's come to light that Boost ASIO (a Boost Process sub dep) has in some
instances, been queitly initialising our network stack on Windows (see
PR bitcoin#28486 and discussion in bitcoin#28940).

This has been shielding a bug in our own code, but the larger issue
is that Boost Process/ASIO is running code before main, and doing things
like setting up networking. This undermines our own assumptions about
how our binary works, happens before we get to run any sanity checks,
and also runs before we call our own code to setup networking.

It's also not clear why a feature like external signer would have a
dependency that would be doing anything network/socket related, given it
only exists to spawn a local process.
@fanquake fanquake merged commit dd73c22 into bitcoin:master Nov 29, 2023
@hebasto hebasto deleted the 230914-winsock branch November 29, 2023 17:17
fanquake added a commit to fanquake/bitcoin that referenced this pull request Nov 29, 2023
It's come to light that Boost ASIO (a Boost Process sub dep) has in some
instances, been queitly initialising our network stack on Windows (see
PR bitcoin#28486 and discussion in bitcoin#28940).

This has been shielding a bug in our own code, but the larger issue
is that Boost Process/ASIO is running code before main, and doing things
like setting up networking. This undermines our own assumptions about
how our binary works, happens before we get to run any sanity checks,
and also runs before we call our own code to setup networking.

It's also not clear why a feature like external signer would have a
dependency that would be doing anything network/socket related, given it
only exists to spawn a local process.
fanquake added a commit to fanquake/bitcoin that referenced this pull request Dec 1, 2023
It's come to light that Boost ASIO (a Boost Process sub dep) has in some
instances, been queitly initialising our network stack on Windows (see
PR bitcoin#28486 and discussion in bitcoin#28940).

This has been shielding a bug in our own code, but the larger issue
is that Boost Process/ASIO is running code before main, and doing things
like setting up networking. This undermines our own assumptions about
how our binary works, happens before we get to run any sanity checks,
and also runs before we call our own code to setup networking.

It's also not clear why a feature like external signer would have a
dependency that would be doing anything network/socket related, given it
only exists to spawn a local process.
fanquake added a commit that referenced this pull request Dec 12, 2023
…ter"

7b22cd8 Revert "ci: Only run functional tests on windows in master" (Hennadii Stepanov)

Pull request description:

  This PR reverts commit aba4a58 from #28567.

  The Windows-specific code received [quality](#28486) and [performance](#29045) improvements recently. So there are no reasons to skip functional tests in PRs anymore.

  In my own repo, I've run the GHA Windows job more than 100 times with no failure.

ACKs for top commit:
  maflcko:
    lgtm ACK 7b22cd8
  TheCharlatan:
    ACK 7b22cd8

Tree-SHA512: 1e8687e8efe12db506e7cd2d5df9e48b5acb98a339f84684dd0fd67280e22227e2a5a206f1108b09e49038fab7a3ca2ffbd985677f00048c0962b39b2b9a2ba5
fanquake added a commit to bitcoin-core/gui that referenced this pull request Dec 13, 2023
308aec3 build: disable external-signer for Windows (fanquake)
3553731 ci: remove --enable-external-signer from win64 job (fanquake)

Pull request description:

  It's come to light that Boost ASIO (a Boost Process sub dep) has in some
  instances, been quietly  initialising our network stack on Windows (see
  PR bitcoin/bitcoin#28486 and discussion in bitcoin/bitcoin#28940).

  This has been shielding a bug in our own code, but the larger issue
  is that Boost Process/ASIO is running code before main, and doing things
  like setting up networking. This undermines our own assumptions about
  how our binary works, happens before we run any sanity checks,
  and before we call our own code to setup networking. Note that ASIO also
  calls WSAStartup with version `2.0`, whereas we call with `2.2`.

  It's also not clear why a feature like external signer would have a
  dependency that would be doing anything network/socket related,
  given it only exists to spawn a local process.

  See also the discussion in bitcoin/bitcoin#24907. Note that the maintaince of Boost Process in general,
  has not really improved. For example, rather than fixing bugs like boostorg/process#111,
  i.e, boostorg/process#317, the maintainer chooses to just wrap exception causing overflows
  in try-catch blocks: boostorg/process@0c42a58. These changes get merged in large,
  unreviewed PRs, i.e boostorg/process#319.

  This PR disables external-signer on Windows for now. If, in future, someone
  changes how Boost Process works, or replaces it entirely with some
  properly reviewed and maintained code, we could reenable this feature on
  Windows.

ACKs for top commit:
  hebasto:
    re-ACK 308aec3.
  TheCharlatan:
    ACK 308aec3

Tree-SHA512: 7405f7fc9833eeaacd6836c4e5b1c1a7845a40c1fdd55c1060152f8d8189e4777464fde650e11eb1539556a75dddf49667105987078b1457493ee772945da66e
luke-jr pushed a commit to luke-jr/bitcoin that referenced this pull request Jan 30, 2024
@bitcoin bitcoin locked and limited conversation to collaborators Nov 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AddrManSelectByNetwork benchmark fails on Windows
5 participants