-
Notifications
You must be signed in to change notification settings - Fork 37.7k
net: call getaddrinfo() in detachable thread to prevent stalling #27557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process. ConflictsNo conflicts as of last run. |
Or just anytime you don't have an internet connection? We're not going to add "internet connectivity" as a requirement to run the unit tests (putting aside all other potential issues with doing this). |
src/netbase.cpp
Outdated
std::thread thread(AsyncGetAddrInfo, &req); | ||
int checks{0}; | ||
while (true) | ||
{ | ||
{ | ||
LOCK(gai_mutex); | ||
if (req.complete) break; | ||
} | ||
// Check every 100ms for 2 seconds | ||
if (++checks > 20) break; | ||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); | ||
} | ||
thread.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe use https://en.cppreference.com/w/cpp/thread/future/wait_until instead?
src/netbase.cpp
Outdated
if (allow_lookup) { | ||
// Execute the DNS lookup in another thread and check it periodically for completion. | ||
// After sufficiently waiting, abandon the thread entirely. The user's system | ||
// may wait forever for a DNS response that never returns. | ||
std::thread thread(AsyncGetAddrInfo, &req); | ||
int checks{0}; | ||
while (true) | ||
{ | ||
{ | ||
LOCK(gai_mutex); | ||
if (req.complete) break; | ||
} | ||
// Check every 100ms for 2 seconds | ||
if (++checks > 20) break; | ||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); | ||
} | ||
thread.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance wise, create a new thread per lookup is quite bad. Should use a workers pool instead.
I have implemented a generic one for #26966 (f448668) that could also use here. It also support callback futures so you can remove the costly active wait as well.
Give it a look and if you like it, could decouple it into a standalone PR so we could start using it in both PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).
Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).
k, getaddrinfo
doesn't have a timeout and getaddrinfo_a
seems to have a segfault. Cool..
Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.
I was thinking more about context switching. Isn't this used for OpenNetworkConnection
too? So, it would create a new thread for every connection attempt? (unless the lookups are numeric, which guess that is the rule here..).
And, following the worst case scenario, what if the user call addnode
a good number of times? we would get a lot of detached threads? (unless we set a limit for the amount of lookups that can be resolved in parallel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, getaddrinfo doesn't have a timeout and getaddrinfo_a seems to have a segfault. Cool..
Yea one would think that making DNS requests in c++ would be a solved problem.
Anyway, I'm also not sure about the approach here. The original issue was that we would block indefinitely while joining the DNS thread on shutdown (only happens if the user's system is broken I think?). I suggested to detach the dns thread if it doesn't join in a reasonable amount of time (would be fine, the OS will clean the thread up since we are about to exit anyway).
The approach here risks having threads hang around for the lifetime of a node, which seems worse than the occasional shutdown hang on broken systems.
56a3cbd
to
4150942
Compare
Still waiting for CI to finish but I removed the need for internet access by looking up the name (run in docker with no ipv6)
I'm interested in following the worker pool idea to make these requests async as opposed to new threads. |
4150942
to
1d57811
Compare
added call to |
1d57811
to
c292288
Compare
...got thread sanitizer errors running locally on main when my local DNS is black-holed. Fixed that with a shared pointer. |
c292288
to
3db04cc
Compare
3db04cc
to
c3ccbb2
Compare
ty. 🤞 |
c3ccbb2
to
2910e59
Compare
if (allow_lookup) { | ||
// Execute the DNS lookup in another thread and check it periodically for completion. | ||
// After sufficiently waiting, abandon the thread entirely. The user's system | ||
// may wait forever for a DNS response that never returns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not good to ignore it never returning... it's leaking resources. Maybe kill it and/or tell the user somehow? :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a log message but not sure what else to do. The problem is that getaddrinfo()
is blocking and getaddrinfo_a()
segfaults 😭 As long as the DNS request resolves eventually the thread will close, even if bitcoind has given up and detached the thread. If getaddrinfo()
hangs forever then yes you're right that is a leak, but the user also has a serious networking issue.
src/test/netbase_tests.cpp
Outdated
{ | ||
// Non-numeric lookup that does not require actual DNS service | ||
BOOST_CHECK(LookupHost("localhost", /*fAllowLookup=*/true).has_value()); | ||
BOOST_CHECK(!LookupHost("fail.doesnotexist", /*fAllowLookup=*/true).has_value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this still make an externally-visible DNS request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are absoutley right, I'll remove fail.doesnotexist
(localhost will NOT make an external request)
2910e59
to
be52064
Compare
Closes #16778
Replaces #27505
The library call
getaddrinfo()
has no internal timeout and depending on the user's system may wait indefinitely for a response when looking up a hostname in the DNS. By making that call in a separate thread, we can abandon it completely after some timeout (currently in this PR, 2 seconds).TODO: