Skip to content

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented Dec 22, 2022

This revives #19420. Since promag is not so active at the moment, I can support this to finally get it merged.

The PR is rebased and comments by jonatack have been addressed.

Once this is merged, I will also reopen #19434.

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 22, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK stickies-v, hebasto
Concept ACK brunoerg, promag
Stale ACK kevkevinpal

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@kevkevinpal
Copy link
Contributor

Running test/functional/feature_abortnode.py before and after change we get a significant performance gain

running before changes f3bc1a7

test/functional/feature_abortnode.py
2022-12-23T04:47:19.826000Z TestFramework (INFO): Initializing test directory /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_k3_6xric
2022-12-23T04:47:21.229000Z TestFramework (INFO): Waiting for crash
2022-12-23T04:47:51.429000Z TestFramework (INFO): Node crashed - now verifying restart fails
2022-12-23T04:47:52.086000Z TestFramework (INFO): Stopping nodes
2022-12-23T04:47:52.300000Z TestFramework (INFO): Cleaning up /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_k3_6xric on exit
2022-12-23T04:47:52.300000Z TestFramework (INFO): Tests successful

32.50 seconds to run test


running test with these changes applied 40528e7

test/functional/feature_abortnode.py
2022-12-23T04:44:30.075000Z TestFramework (INFO): Initializing test directory /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_0h3w_rbx
2022-12-23T04:44:31.516000Z TestFramework (INFO): Waiting for crash
2022-12-23T04:44:31.837000Z TestFramework (INFO): Node crashed - now verifying restart fails
2022-12-23T04:44:32.452000Z TestFramework (INFO): Stopping nodes
2022-12-23T04:44:32.727000Z TestFramework (INFO): Cleaning up /var/folders/9g/cvx014yx4dq5lwl_x5zvn_j80000gn/T/bitcoin_func_test_0h3w_rbx on exit
2022-12-23T04:44:32.727000Z TestFramework (INFO): Tests successful

1.2 seconds to run test

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK - thanks for reviving this!

Copy link
Contributor

@brunoerg brunoerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

@promag
Copy link
Contributor

promag commented Dec 23, 2022

Concept ACK :trollface:

@fjahr fjahr force-pushed the 202212-pr19420 branch 2 times, most recently from 77a6ea8 to f95063e Compare December 25, 2022 18:56
@kevkevinpal
Copy link
Contributor

retACK f95063e with test/functional/feature_abortnode.py

@hebasto
Copy link
Member

hebasto commented Dec 27, 2022

From #15363 (comment):

The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.

Suggesting to add a test which will prevent any regressions in the future.

@promag
Copy link
Contributor

promag commented Jan 3, 2023

From #15363 (comment):

The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.

Suggesting to add a test which will prevent any regressions in the future.

@hebasto like

self.nodes[0].wait_until_stopped(timeout=200)
should take less than 5s or so?

@fjahr
Copy link
Contributor Author

fjahr commented Jan 5, 2023

From #15363 (comment):

The libevent HTTP interface is kind of crazy, so I don't think it's surprising that fixing this requires some trial and error. But it would be good to have tests for each issue that gets fixed to avoid going in circles.

Suggesting to add a test which will prevent any regressions in the future.

@hebasto like

self.nodes[0].wait_until_stopped(timeout=200)

should take less than 5s or so?

I have updated the test and it seems to be reliably failing on master, so I think this is a good enough regression test @hebasto ?

Also addressed other comments by @hebasto .

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK 6d3a848.

WAIT_LOCK(g_requests_mutex, lock);
g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) {
if (!g_requests.empty()) {
LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having side effects in the predicate seems a bad idea due to spurious wakeups. It should be a pure function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, then I think this isn't possible anymore so I am reverting back to how I did it before. I think that's fine though.

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK 6d3a848

Can anyone please confirm or correct my understanding of why this so much faster?

My current understanding is that even when all of the workers in g_thread_http_workers have terminated (i.e. the entire queue is processed and running==false), the http server thread g_thread_http does not return (when joining) immediately because there are still some idle connections (with whom?) open, preventing the server from shutting down. By first (instead of afterwards) calling evhttp_free(eventHTTP), we can safely destroy the server even if connections are still open (but all requests are handled), causing g_thread_http.join() to return immediately in StopHTTPServer() as opposed to waiting for a timeout.

I think the commit and PR message would benefit from a more clear reason why this works.

// Track requests and notify when a request is completed.
{
WITH_LOCK(g_requests_mutex, g_requests.insert(req));
g_requests_cv.notify_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this notification? I think we (currently) only need to monitor for completed requests, not new ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see #26742 (comment) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen that, thanks for linking.

It's a minor concern, but I'm a bit worried that we're creating a very general contract ("g_requests_cv will always notify for any changes to the g_requests") on a global variable when it's enforced outside of g_requests, when in the future we might easily introduce mutations on g_requests without notifying the cv. A more scoped cv, such as g_requests_cv_handled largely mitigates that.

But, again, no big concern, and it's not likely to cause any major problems should we miss something there. As you see fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think one approach to improve on this contract (and as a nice benefit, also encapsulate code a bit better - I think this improves readability) in the below diff.

(WaitUntilEmpty() needs improvement because it shouldn't directly be logging about the HTTP server like that. I guess we also don't need a GlobalMutex here.)

git diff on 55bbf04
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index b481ef093..90ef37bd5 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -134,6 +134,40 @@ struct HTTPPathHandler
     HTTPRequestHandler handler;
 };
 
+class RequestTracker
+{
+public:
+    typedef std::unordered_set<evhttp_request*> requests_t;
+private:
+    GlobalMutex m_requests_mutex;
+    std::condition_variable m_requests_cv;
+    requests_t m_requests GUARDED_BY(m_requests_mutex);
+public:
+    void Register(struct evhttp_request* req)
+    {
+        WITH_LOCK(m_requests_mutex, m_requests.insert(req));
+        m_requests_cv.notify_all();
+    }
+
+    size_t Remove(struct evhttp_request* req)
+    {
+        auto n{WITH_LOCK(m_requests_mutex, return m_requests.erase(req))};
+        m_requests_cv.notify_all();
+        return n;
+    }
+
+    void WaitUntilEmpty()
+    {
+        WAIT_LOCK(m_requests_mutex, lock);
+        if (!m_requests.empty()) {
+            LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", m_requests.size());
+        }
+        m_requests_cv.wait(lock, [this]() EXCLUSIVE_LOCKS_REQUIRED(m_requests_mutex) {
+            return m_requests.empty();
+        });
+    }
+};
+
 /** HTTP module state */
 
 //! libevent event loop
@@ -149,10 +183,8 @@ static GlobalMutex g_httppathhandlers_mutex;
 static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex);
 //! Bound listening sockets
 static std::vector<evhttp_bound_socket *> boundSockets;
-//! Track active requests
-static GlobalMutex g_requests_mutex;
-static std::condition_variable g_requests_cv;
-static std::unordered_set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
+static RequestTracker g_request_tracker;
+
 
 /** Check if a network address is allowed to access the HTTP server */
 static bool ClientAllowed(const CNetAddr& netaddr)
@@ -216,12 +248,9 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
 {
     // Track requests and notify when a request is completed.
     {
-        WITH_LOCK(g_requests_mutex, g_requests.insert(req));
-        g_requests_cv.notify_all();
+        g_request_tracker.Register(req);
         evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) {
-            auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
-            assert(n == 1);
-            g_requests_cv.notify_all();
+            assert(g_request_tracker.Remove(req) == 1);
         }, nullptr);
     }
 
@@ -477,15 +506,7 @@ void StopHTTPServer()
         evhttp_del_accept_socket(eventHTTP, socket);
     }
     boundSockets.clear();
-    {
-        WAIT_LOCK(g_requests_mutex, lock);
-        if (!g_requests.empty()) {
-            LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
-        }
-        g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) {
-            return g_requests.empty();
-        });
-    }
+    g_request_tracker.WaitUntilEmpty();
     if (eventHTTP) {
         // Schedule a callback to call evhttp_free in the event base thread, so
         // that evhttp_free does not need to be called again after the handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but it's a significant change and this PR already has a pretty long history and several ACKs for the current approach. I would suggest you open a follow-up PR with this as an additional improvement.

@fjahr
Copy link
Contributor Author

fjahr commented Jan 9, 2023

Can anyone please confirm or correct my understanding of why this so much faster?

My current understanding is that even when all of the workers in g_thread_http_workers have terminated (i.e. the entire queue is processed and running==false), the http server thread g_thread_http does not return (when joining) immediately because there are still some idle connections (with whom?) open, preventing the server from shutting down. By first (instead of afterwards) calling evhttp_free(eventHTTP), we can safely destroy the server even if connections are still open (but all requests are handled), causing g_thread_http.join() to return immediately in StopHTTPServer() as opposed to waiting for a timeout.

Sounds good to me. I guess libevent doesn't close the idle connection as an optimization (less overhead if the client wants to send more requests) and if the client doesn't close them then we are seeing this issue. I would add that the overhead that we see in the functional test is the timeout we have configured on the http server ourselves here.

@promag
Copy link
Contributor

promag commented Jan 9, 2023

I guess libevent doesn't close the idle connection as an optimization (less overhead if the client wants to send more requests) and if the client doesn't close them then we are seeing this issue.

Not really an optimization of libevent, AuthServiceProxy used in the test framework creates a persistent connection (to reuse the same connection for consecutive HTTP requests, as @fjahr says).

Copy link
Contributor

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. With this change evhttp_free is called after all requests are completed. Calling evhttp_free at this point will cause all active connections to close and nothing remains on the event base, and all RPC threads are ready to be joined. As mentioned above, this is useful to avoid hitting the idle timeout.

Note that shutdown still hangs while a request is in-flight.

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 55bbf04

I have some open suggestions, but I'm happy with the PR as it is.

// Track requests and notify when a request is completed.
{
WITH_LOCK(g_requests_mutex, g_requests.insert(req));
g_requests_cv.notify_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think one approach to improve on this contract (and as a nice benefit, also encapsulate code a bit better - I think this improves readability) in the below diff.

(WaitUntilEmpty() needs improvement because it shouldn't directly be logging about the HTTP server like that. I guess we also don't need a GlobalMutex here.)

git diff on 55bbf04
diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index b481ef093..90ef37bd5 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -134,6 +134,40 @@ struct HTTPPathHandler
     HTTPRequestHandler handler;
 };
 
+class RequestTracker
+{
+public:
+    typedef std::unordered_set<evhttp_request*> requests_t;
+private:
+    GlobalMutex m_requests_mutex;
+    std::condition_variable m_requests_cv;
+    requests_t m_requests GUARDED_BY(m_requests_mutex);
+public:
+    void Register(struct evhttp_request* req)
+    {
+        WITH_LOCK(m_requests_mutex, m_requests.insert(req));
+        m_requests_cv.notify_all();
+    }
+
+    size_t Remove(struct evhttp_request* req)
+    {
+        auto n{WITH_LOCK(m_requests_mutex, return m_requests.erase(req))};
+        m_requests_cv.notify_all();
+        return n;
+    }
+
+    void WaitUntilEmpty()
+    {
+        WAIT_LOCK(m_requests_mutex, lock);
+        if (!m_requests.empty()) {
+            LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", m_requests.size());
+        }
+        m_requests_cv.wait(lock, [this]() EXCLUSIVE_LOCKS_REQUIRED(m_requests_mutex) {
+            return m_requests.empty();
+        });
+    }
+};
+
 /** HTTP module state */
 
 //! libevent event loop
@@ -149,10 +183,8 @@ static GlobalMutex g_httppathhandlers_mutex;
 static std::vector<HTTPPathHandler> pathHandlers GUARDED_BY(g_httppathhandlers_mutex);
 //! Bound listening sockets
 static std::vector<evhttp_bound_socket *> boundSockets;
-//! Track active requests
-static GlobalMutex g_requests_mutex;
-static std::condition_variable g_requests_cv;
-static std::unordered_set<evhttp_request*> g_requests GUARDED_BY(g_requests_mutex);
+static RequestTracker g_request_tracker;
+
 
 /** Check if a network address is allowed to access the HTTP server */
 static bool ClientAllowed(const CNetAddr& netaddr)
@@ -216,12 +248,9 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
 {
     // Track requests and notify when a request is completed.
     {
-        WITH_LOCK(g_requests_mutex, g_requests.insert(req));
-        g_requests_cv.notify_all();
+        g_request_tracker.Register(req);
         evhttp_request_set_on_complete_cb(req, [](struct evhttp_request* req, void*) {
-            auto n = WITH_LOCK(g_requests_mutex, return g_requests.erase(req));
-            assert(n == 1);
-            g_requests_cv.notify_all();
+            assert(g_request_tracker.Remove(req) == 1);
         }, nullptr);
     }
 
@@ -477,15 +506,7 @@ void StopHTTPServer()
         evhttp_del_accept_socket(eventHTTP, socket);
     }
     boundSockets.clear();
-    {
-        WAIT_LOCK(g_requests_mutex, lock);
-        if (!g_requests.empty()) {
-            LogPrint(BCLog::HTTP, "Waiting for %d requests to stop HTTP server\n", g_requests.size());
-        }
-        g_requests_cv.wait(lock, []() EXCLUSIVE_LOCKS_REQUIRED(g_requests_mutex) {
-            return g_requests.empty();
-        });
-    }
+    g_request_tracker.WaitUntilEmpty();
     if (eventHTTP) {
         // Schedule a callback to call evhttp_free in the event base thread, so
         // that evhttp_free does not need to be called again after the handling

promag and others added 2 commits February 10, 2023 20:35
This was made obsolete by tracking the active requests and explicitly waiting for them to finish before shutdown.
@fjahr
Copy link
Contributor Author

fjahr commented Feb 10, 2023

Addressed comments by @promag and @stickies-v

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 60978c8

@DrahtBot DrahtBot requested a review from kevkevinpal February 11, 2023 12:36
Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 60978c8


#include <sys/types.h>
#include <sys/stat.h>

#include <event2/buffer.h>
#include <event2/bufferevent.h>
#include <event2/http.h>
#include <event2/http_struct.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IWYU suggests to s/#include <event2/http_struct.h>/#include <event2/event.h>/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will fix this in the follow-up that will pick up #19434.

@bitcoin bitcoin deleted a comment Mar 4, 2023
@fanquake fanquake requested review from john-moffett and removed request for kevkevinpal March 5, 2023 09:45
@DrahtBot DrahtBot requested a review from kevkevinpal March 6, 2023 12:49
@achow101
Copy link
Member

achow101 commented Mar 7, 2023

ACK 60978c8

@achow101 achow101 merged commit 86bacd7 into bitcoin:master Mar 7, 2023
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Mar 7, 2023
… finish - 2nd attempt

60978c8 test: Reduce extended timeout on abortnode test (Fabian Jahr)
660bdbf http: Release server before waiting for event base loop exit (João Barbosa)
8c6d007 http: Track active requests and wait for last to finish (João Barbosa)

Pull request description:

  This revives bitcoin#19420. Since promag is not so active at the moment, I can support this to finally get it merged.

  The PR is rebased and comments by jonatack have been addressed.

  Once this is merged, I will also reopen bitcoin#19434.

ACKs for top commit:
  achow101:
    ACK 60978c8
  stickies-v:
    re-ACK [60978c8](bitcoin@60978c8)
  hebasto:
    ACK 60978c8

Tree-SHA512: eef0fe1081e9331b95cfafc71d82f2398abd1d3439dac5b2fa5c6d9c0a3f63ef19adde1c38c88d3b4e7fb41ce7c097943f1815c10e33d165918ccbdec512fe1c
fanquake added a commit that referenced this pull request Oct 4, 2023
…lient disconnection

68f23f5 http: bugfix: track closed connection (stickies-v)
084d037 http: log connection instead of request count (stickies-v)
41f9027 http: refactor: use encapsulated HTTPRequestTracker (stickies-v)

Pull request description:

  #26742 significantly increased the http server shutdown speed, but also introduced a bug (#27722 - see #27722 (comment) for steps to reproduce on master) that causes http server shutdown to halt in case of a remote client disconnection. This happens because `evhttp_request_set_on_complete_cb` is never called and thus the request is never removed from `g_requests`.

  This PR fixes that bug, and improves robustness of the code by encapsulating the request tracking logic. Earlier approaches (#27909, #27245, #19434) attempted to resolve this but [imo are fundamentally unsafe](#27909 (comment)) because of differences in lifetime between an `evhttp_request` and `evhttp_connection`.

  We don't need to keep track of open requests or connections, we just [need to ensure](#19420 (comment)) that there are no active requests on server shutdown. Because a connection can have multiple requests, and a request can be completed in various ways (the request actually being handled, or the client performing a remote disconnect), keeping a counter per connection seems like the approach with the least overhead to me.

  Fixes #27722

ACKs for top commit:
  vasild:
    ACK 68f23f5
  theStack:
    ACK 68f23f5

Tree-SHA512: dfa711ff55ec75ba44d73e9e6fac16b0be25cf3c20868c2145a844a7878ad9fc6998d9ff62d72f3a210bfa411ef03d3757b73d68a7c22926e874c421e51444d6
Frank-GER pushed a commit to syscoin/syscoin that referenced this pull request Oct 13, 2023
…emote client disconnection

68f23f5 http: bugfix: track closed connection (stickies-v)
084d037 http: log connection instead of request count (stickies-v)
41f9027 http: refactor: use encapsulated HTTPRequestTracker (stickies-v)

Pull request description:

  bitcoin#26742 significantly increased the http server shutdown speed, but also introduced a bug (bitcoin#27722 - see bitcoin#27722 (comment) for steps to reproduce on master) that causes http server shutdown to halt in case of a remote client disconnection. This happens because `evhttp_request_set_on_complete_cb` is never called and thus the request is never removed from `g_requests`.

  This PR fixes that bug, and improves robustness of the code by encapsulating the request tracking logic. Earlier approaches (bitcoin#27909, bitcoin#27245, bitcoin#19434) attempted to resolve this but [imo are fundamentally unsafe](bitcoin#27909 (comment)) because of differences in lifetime between an `evhttp_request` and `evhttp_connection`.

  We don't need to keep track of open requests or connections, we just [need to ensure](bitcoin#19420 (comment)) that there are no active requests on server shutdown. Because a connection can have multiple requests, and a request can be completed in various ways (the request actually being handled, or the client performing a remote disconnect), keeping a counter per connection seems like the approach with the least overhead to me.

  Fixes bitcoin#27722

ACKs for top commit:
  vasild:
    ACK 68f23f5
  theStack:
    ACK 68f23f5

Tree-SHA512: dfa711ff55ec75ba44d73e9e6fac16b0be25cf3c20868c2145a844a7878ad9fc6998d9ff62d72f3a210bfa411ef03d3757b73d68a7c22926e874c421e51444d6
@bitcoin bitcoin locked and limited conversation to collaborators Mar 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants