-
Notifications
You must be signed in to change notification settings - Fork 37.7k
http: Make server shutdown more robust #31929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There is a race between ThreadHTTP exiting and our queuing of the freeing-event (event_base_once). Additional free was useful in 709 of 962 runs.
Could happen when connection doesn't complete for some reason. Output early warning to give clue about what's up. Prior check+log message before even starting to wait was a bit too eager to hint at something being wrong.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31929. ReviewsSee the guideline for information on the review process. ConflictsNo conflicts as of last run. |
Assertion failed: (evb), function WriteReply, file httpserver.cpp, line 682.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/http_request/83149a7e3abd937c7bc67fa55b9a4fc9338e3d8c"
⚠️ Failure generated from target with exit code 1: ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/http_request')] |
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
Could aid debugging when an HTTP request gets stuck at shutdown.
ef290c0
to
24558c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fuzz-failure was me underestimating a default-initialized argument. Fixed in 2nd push.
// Schedule a callback to call evhttp_free in the event base thread, as | ||
// otherwise eventHTTP often keeps internal events alive, meaning | ||
// aforementioned thread would never run out of events and exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To verify the improved accuracy of the documentation here one can apply:
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index 52fa8463d2..a3efaa2579 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -557,6 +557,10 @@ void StopHTTPServer()
// Schedule a callback to call evhttp_free in the event base thread, as
// otherwise eventHTTP often keeps internal events alive, meaning
// aforementioned thread would never run out of events and exit.
+ while (true) {
+ event_base_dump_events(eventBase, stderr);
+ std::this_thread::sleep_for(5s);
+ }
if (event_base_once(eventBase, -1, EV_TIMEOUT, [](evutil_socket_t, short, void*) {
evhttp_free(eventHTTP);
eventHTTP = nullptr;
Run bitcoind, and then send the stop-RPC from bitcoin-cli. It gave me the following log:
2025-02-22T15:18:19Z [http] Stopping HTTP server
2025-02-22T15:18:19Z [http] Waiting for HTTP worker threads to exit
Inserted events:
2025-02-22T15:18:19Z [http] Exited http event loop
0x555556947a50 [fd 11] Read Persist Internal
Active events:
2025-02-22T15:18:19Z net thread exit
2025-02-22T15:18:20Z msghand thread exit
Inserted events:
0x555556947a50 [fd 11] Read Persist Internal
Active events:
Inserted events:
0x555556947a50 [fd 11] Read Persist Internal
Active events:
Inserted events:
0x555556947a50 [fd 11] Read Persist Internal
Active events:
<repeated until killing the process>
From what i remember this was extrememly hard to get right with libevent (or at least our use of it), without leaking anything or running into race conditions. There have been a few tries to fix the behavior over the years, but they all were different compromises, adding ever more compexity, and often introducing new problems. As it seems to be where things are heading anyway (#32061) maybe it's better to focus on replacing libevent and its webserver framework wholesale. |
Thank you for providing your perspective @laanwj! In working on this PR, I've also gained an appreciation for getting rid of libevent. However, I'm not sure when that will happen, maybe not until post-30.0. In the meantime we are seeing issues on CI such as #31894 (analysis). What this PR is doing is adding our own request ids to track which HTTP request never finishes. When we have stuck HTTP requests, it also makes us abort the process with a clear error after 10min30s instead of waiting for the test framework to time out after 40min. On top of that it also fixes an intermittent leak, right after we've joined on the thread that is the only one to sometimes perform deletion. Have come across #26742 and #19420 while working on this, so have seen some of the struggle. Don't have an overall grasp on how frequent issues like 31894 are on CI (or in the wild), it's fair that that frequency should influence the priority of this compared to other work. |
Should help debug #31894.