Skip to content

Conversation

4tar
Copy link
Contributor

@4tar 4tar commented May 22, 2014

...stage quickly

The dnsseed thread is starting by default and takes quick some time, if the user wants to stop the bitcoind by Ctrl-C in command line or a kill in daemon mode when the dnsseed is still working, s/he would have to wait for it to finish all the job, in such situation, an extra 'kill -9' would be expected.

By adding two interruption points just before the time-consuming operations we give the dnsseed thread a chance to realize the user's intention and quit its job much quickly, usually in seconds instead of several minutes or more depends on network situation.

Signed-off-by: Huang Le 4tarhl@gmail.com

…ly stage quickly

Signed-off-by: Huang Le <4tarhl@gmail.com>
@laanwj
Copy link
Member

laanwj commented May 23, 2014

untested ACK, code changes OK

@sipa
Copy link
Member

sipa commented May 23, 2014

untested ACK

@4tar
Copy link
Contributor Author

4tar commented May 23, 2014

I did the testing locally, any thing I need to do for pushing the commit?

… responsiveness of net threads, typically the dnsseed thread

Signed-off-by: Huang Le <4tarhl@gmail.com>
@4tar
Copy link
Contributor Author

4tar commented May 24, 2014

By using async getaddrinfo_a() instead of getaddrinfo() in netbase.cpp::LookupIntern(), the responsiveness issue of dnsseed thread is finally resolved.

And to note: With the 2nd commit in this pull request, the 1st one is not necessary now, although leaving it there would cause no problem.

@fanquake
Copy link
Member

@laanwj @sipa You'll need to re-review/re-ACK this pull.

…ript, and modify corresponding conditional directive in netbase.cpp to ensure we can use the lib properly.

Signed-off-by: Huang Le <4tarhl@gmail.com>
@4tar
Copy link
Contributor Author

4tar commented May 24, 2014

Sorry, the previous building error was caused by missing the libanl, added it in last commit.

Have done a clean rebuilding and testing locally, please kindly have a review. Thanks.

@@ -367,6 +367,7 @@ if test x$TARGET_OS = xdarwin; then
fi

AC_CHECK_HEADERS([stdio.h stdlib.h unistd.h strings.h sys/types.h sys/stat.h])
AC_CHECK_LIB(anl, getaddrinfo_a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tie this check to the function and not glibc's usage, in case other libc's (uclibc, musl, etc) implement it as part of libc itself:

AC_SEARCH_LIBS([getaddrinfo_a], [anl], [AC_DEFINE(HAVE_GETADDRINFO_A, 1,[Define this symbol if you have getaddrinfo_a])])

Then in code, use: #if defined(HAVE_GETADDRINFO_A)

…anl, for better compatibility with other libc implementations different from glibc

Signed-off-by: Huang Le <4tarhl@gmail.com>
@4tar
Copy link
Contributor Author

4tar commented May 24, 2014

@theuni Thanks, done as you suggested in last commit.

struct addrinfo *aiRes = NULL;
#if defined(_GNU_SOURCE) && defined(HAVE_GETADDRINFO_A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the defined(_GNU_SOURCE) necessary here? (i.e. isn't HAVE_GETADDRINFO_A in itself enough to know that the function exists?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it's unnecessary in my dev environment, but I just added it for if someone was building with an older toolchain, although such a toolchain might fail to get HAVE_GETADDRINFO_A defined by configure script at the first place, but just in case.

Sure it should be pretty safe to remove the macro definition check, I'd like to do that if you think that is desired, but it could be even safer (ok, quite little) to leave it there:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laanwj Yes, I consider the situation for a while and come to agree on your point: we should remove the _GNU_SOURCE macro definition checking because someone might build it with another libc implementation, and in such a case, the _GNU_SOURCE was unlikely defined by the compiler but the getaddrinfo_a() could have been implemented in the c library itself.

Will submit a commit immediately. Thanks for your kind review!

1. The _GNU_SOURCE macro definition checking in netbase.cpp, the HAVE_GETADDRINFO_A is enough for using getaddrinfo_a();
2. The interruption points added in net.cpp::ThreadDNSAddressSeed(), using getaddrinfo_a() instead of getaddrinfo() is enough to make the dnsseed thread responsive.

Signed-off-by: Huang Le <4tarhl@gmail.com>
@4tar
Copy link
Contributor Author

4tar commented May 25, 2014

@laanwj Done. Hope it is the last commit for this pull request. :)

@theuni
Copy link
Member

theuni commented May 26, 2014

Does this launch a new thread for each resolve? If so, I think it would be necessary to do a quick probe at random times to see what the thread count looks like before/after this change. If it's a drastic change, this will need some sort of rate-limiting.

Also (not necessary for this PR), it might be beneficial to maintain a global resolve list in some class so that a single tight loop can be checking the status of all resolves simultaneously, rather than each resolve getting its own loop. From the examples of getaddrinfo_a in the man pages, it seems that's how it is meant to be used.

@4tar
Copy link
Contributor Author

4tar commented May 27, 2014

@theuni Yes, the getaddrinfo_a() call does kick off a new thread for resolving the name and poll on it in its own thread. So when there is one resolving job, there would be one more thread than before this change.

However, the real name resolving job happens mostly, if not only, in dnsseed thread. To all other calls to the LookupIntern() function, they are generally kicked off by the constructor of CNetAddr which actually just want to convert a string address into a network address structure, and this kind of job, which is not a real name resolving one, can be done without getaddrinfo(), but a lighter inet_aton(). By detecting this case, we can dramatically reduce the overhead of getaddrinfo(), and in general there should be only one more thread and only when dnsseed thread is running.

Will submit a patch for this PR according to the above soon.

For the global resolver idea, one thing need to pointed out is that it doesn't reduce the thread count but add one (the global resolver itself) instead, while it is extremely helpful for the throughput if there were lots of real async name resolving jobs kicked off simultaneously and the kickers don't need the result immediately. But as mentioned above, the dnsseed seems the only one needs the result of a real name resolving, and it works in a sync manner, so the global resolver doesn't help here. Actually why we switch to a async getaddrinfo_a() is just for responsiveness purpose but not throughput. We can absolutely rewrite the dnsseed thread to let it work in an async manner and improve its throughput, but given it's only running for several minutes, or tens of minutes if the network situation is really bad, before exiting, it doesn't worth the effort I guess. Or maybe later if we figure out the effort is really needed.

4tar added 2 commits May 27, 2014 13:23
In LookupIntern(), the getaddrinfo_a() call would be issued if it's avaiable, which in turn would kick off a new thread for async name resolving job.  In our code, most calls go to LookupIntern() are just trying to convert a text ip address into a struct but not a real name resolving request, so we should do that first to avoid extra thread overhead.

Signed-off-by: Huang Le <4tarhl@gmail.com>
Signed-off-by: Huang Le <4tarhl@gmail.com>
@laanwj
Copy link
Member

laanwj commented May 27, 2014

@theuni, @4tar This is starting to be too complicated. Don't get too deep in this rabbit hole. I would accept this simple patch, but I don't want to merge a whole thread/job management system just so that interruption can happen sooner.

We already have boost::asio for that, if really necessary. No need to replicate its functionality.

But is the number of threads created by the async resolve really that bad that we need some complex system to work around it? Are we using the resolver this intensively at all? Does it create DoS risk?

@4tar
Copy link
Contributor Author

4tar commented May 27, 2014

Interesting, looks like the i586-mingw32msvc-g++ building env doesn't have inet_pton() defined. Will submit a patch to AC_SEARCH_LIB for it.

Looks like mingw env doesn't define this function yet, so we have to search it first to avoid breaking building in mingw env.

Signed-off-by: Huang Le <4tarhl@gmail.com>
@4tar
Copy link
Contributor Author

4tar commented May 27, 2014

@laanwj I agree with you that we don't really need an async resolver now, so I just add a inet_pton()/inet_addr() call in the LookupIntern() before issuing the getaddrinfo_a(), so that we can avoid kicking off a new resolving thread inside the latter, when the caller just want to convert an ip addr text to an ip addr structure, which, as we have discussed above, is the most cases that the LookupIntern() get called. By doing this, we have already removed the possible thread count problems.

Hope I have cleared mingw env building error in last commit...

@theuni
Copy link
Member

theuni commented May 27, 2014

@4tar: I'm not very familiar with the networking code, so I'll give your analysis the benefit of the doubt. Thanks for addressing my thread-count concern.

As for inet_pton, please add "nsl resolv" to the list of libs where it may be found (the empty 2nd param).

@4tar
Copy link
Contributor Author

4tar commented May 27, 2014

@theuni Thanks for your review! Setting up a correct lib list is always a mystery to me...

Signed-off-by: Huang Le <4tarhl@gmail.com>
@BitcoinPullTester
Copy link

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/978d2881b00fbc98789b64e8ed2ec4bd2e8fadc5 for binaries and test log.
This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/
Contact BlueMatt on freenode if something looks broken.

@4tar
Copy link
Contributor Author

4tar commented May 29, 2014

@laanwj Anything I can do to make this PR be merged a little bit soon? It passed the test and I believe it is ready.

The reason I'm asking is that I made a mistake to commit directly on my master branch and send the PR, so the pending causes big issue on my branch management...

@4tar
Copy link
Contributor Author

4tar commented May 30, 2014

OK, I'm closing this RP and will re-submit it from another branch...

@4tar 4tar closed this May 30, 2014
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants