Skip to content

Conversation

pinheadmz
Copy link
Member

@pinheadmz pinheadmz commented May 2, 2023

Closes #16778
Replaces #27505

The library call getaddrinfo() has no internal timeout and depending on the user's system may wait indefinitely for a response when looking up a hostname in the DNS. By making that call in a separate thread, we can abandon it completely after some timeout (currently in this PR, 2 seconds).

TODO:

  • We could make the polling loop interruptible but I'm not sure the best approach to that, since these functions don't live in a class with a flag like interruptNet

@DrahtBot
Copy link
Contributor

DrahtBot commented May 2, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

@fanquake
Copy link
Member

fanquake commented May 3, 2023

The unit test I added makes a live DNS request for nic.com that test will fail the platform has no DNS

Or just anytime you don't have an internet connection? We're not going to add "internet connectivity" as a requirement to run the unit tests (putting aside all other potential issues with doing this).

src/netbase.cpp Outdated
Comment on lines 106 to 117
std::thread thread(AsyncGetAddrInfo, &req);
int checks{0};
while (true)
{
{
LOCK(gai_mutex);
if (req.complete) break;
}
// Check every 100ms for 2 seconds
if (++checks > 20) break;
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
thread.detach();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/netbase.cpp Outdated
Comment on lines 102 to 117
if (allow_lookup) {
// Execute the DNS lookup in another thread and check it periodically for completion.
// After sufficiently waiting, abandon the thread entirely. The user's system
// may wait forever for a DNS response that never returns.
std::thread thread(AsyncGetAddrInfo, &req);
int checks{0};
while (true)
{
{
LOCK(gai_mutex);
if (req.complete) break;
}
// Check every 100ms for 2 seconds
if (++checks > 20) break;
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
thread.detach();
Copy link
Member

@furszy furszy May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance wise, create a new thread per lookup is quite bad. Should use a workers pool instead.

I have implemented a generic one for #26966 (f448668) that could also use here. It also support callback futures so you can remove the costly active wait as well.

Give it a look and if you like it, could decouple it into a standalone PR so we could start using it in both PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).

Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).

k, getaddrinfo doesn't have a timeout and getaddrinfo_a seems to have a segfault. Cool..

Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.

I was thinking more about context switching. Isn't this used for OpenNetworkConnection too? So, it would create a new thread for every connection attempt? (unless the lookups are numeric, which guess that is the rule here..).

And, following the worst case scenario, what if the user call addnode a good number of times? we would get a lot of detached threads? (unless we set a limit for the amount of lookups that can be resolved in parallel).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, getaddrinfo doesn't have a timeout and getaddrinfo_a seems to have a segfault. Cool..

Yea one would think that making DNS requests in c++ would be a solved problem.

Anyway, I'm also not sure about the approach here. The original issue was that we would block indefinitely while joining the DNS thread on shutdown (only happens if the user's system is broken I think?). I suggested to detach the dns thread if it doesn't join in a reasonable amount of time (would be fine, the OS will clean the thread up since we are about to exit anyway).

The approach here risks having threads hang around for the lifetime of a node, which seems worse than the occasional shutdown hang on broken systems.

@pinheadmz pinheadmz force-pushed the async-getaddrinfo branch 6 times, most recently from 56a3cbd to 4150942 Compare May 3, 2023 17:57
@pinheadmz
Copy link
Member Author

pinheadmz commented May 3, 2023

Still waiting for CI to finish but I removed the need for internet access by looking up the name "localhost" instead. That is a non-numeric lookup that getaddrinfo() can do without the actual DNS. I think I also fixed the ipv6 test issue:

(run in docker with no ipv6)

test/netbase_tests.cpp(161): Entering test case "dns_lookup"
test/netbase_tests.cpp(165): info: check LookupHost("localhost", addr, true) has passed
test/netbase_tests.cpp(169): info: check Lookup("127.0.0.1:12345", serv, 9050, true) has passed
test/netbase_tests.cpp(176): warning: in "netbase_tests/dns_lookup": Skipping IPv6 check
test/netbase_tests.cpp(161): Leaving test case "dns_lookup"; testing time: 217012us

I'm interested in following the worker pool idea to make these requests async as opposed to new threads.

@DrahtBot DrahtBot removed the CI failed label May 3, 2023
@pinheadmz pinheadmz force-pushed the async-getaddrinfo branch from 4150942 to 1d57811 Compare May 5, 2023 17:14
@pinheadmz
Copy link
Member Author

added call to thread.join() in the "happy path" so the only threads we should be abandoning are those that time out. Also setup the unit test to stub in a black hole for getaddressinfo to ensure the 2-second timeout is enforced. On my machine at least with thread and memory sanitizers I'm not getting any errors for the abandoned thread. I'm going to try some more experiments with a mainnet node as well.

@pinheadmz pinheadmz force-pushed the async-getaddrinfo branch from 1d57811 to c292288 Compare May 5, 2023 18:34
@pinheadmz
Copy link
Member Author

...got thread sanitizer errors running locally on main when my local DNS is black-holed. Fixed that with a shared pointer.

@pinheadmz
Copy link
Member Author

Needs rebase?

ty. 🤞

if (allow_lookup) {
// Execute the DNS lookup in another thread and check it periodically for completion.
// After sufficiently waiting, abandon the thread entirely. The user's system
// may wait forever for a DNS response that never returns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not good to ignore it never returning... it's leaking resources. Maybe kill it and/or tell the user somehow? :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a log message but not sure what else to do. The problem is that getaddrinfo() is blocking and getaddrinfo_a() segfaults 😭 As long as the DNS request resolves eventually the thread will close, even if bitcoind has given up and detached the thread. If getaddrinfo() hangs forever then yes you're right that is a leak, but the user also has a serious networking issue.

{
// Non-numeric lookup that does not require actual DNS service
BOOST_CHECK(LookupHost("localhost", /*fAllowLookup=*/true).has_value());
BOOST_CHECK(!LookupHost("fail.doesnotexist", /*fAllowLookup=*/true).has_value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this still make an externally-visible DNS request?

Copy link
Member Author

@pinheadmz pinheadmz Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are absoutley right, I'll remove fail.doesnotexist (localhost will NOT make an external request)

@pinheadmz pinheadmz force-pushed the async-getaddrinfo branch from 2910e59 to be52064 Compare July 5, 2023 19:10
@achow101 achow101 requested review from theuni and fanquake September 20, 2023 16:59
@achow101 achow101 closed this Apr 9, 2024
@bitcoin bitcoin locked and limited conversation to collaborators Apr 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadDNSAddressSeed hangs on sk_wait_data and doesn't stop on exit
7 participants