Add a -rpckeepalive and disable RPC use of HTTP persistent connections. #5655

gmaxwell · 2015-01-14T01:54:15Z

It turns out that some miners have been staying with old versions of
Bitcoin Core because their software behaves poorly with persistent
connections and the Bitcoin Core thread and connection limits.

What happens is that underlying HTTP libraries leave connections open
invisibly to their users and then the user runs into the default four
thread limit. This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
here, e.g. supporting more concurrent connections, not tying up threads
for idle connections, disconnecting kept-alive connections when limits
are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
off the keep-alive support in the client where it may be buried in some
platform library.

If you are one of the few who really needs persistent connections you
probably know that you want them and can find a switch; while if you
don't and the misbehavior is hitting you it is hard to discover the
source of your problems is keepalive related. Given that it is best
to default to off until they're handled better.

sipa · 2015-01-14T02:06:11Z

Nitty suggestion: call it -rpckeepalive with default value 1.

gmaxwell · 2015-01-14T02:18:07Z

Fair enough.

jonasschnelli · 2015-01-14T04:47:55Z

Somehow related to #5526.

laanwj · 2015-01-14T07:24:07Z

Is there any practical advantage to having http keep-alive support in Bitcoin Core?
e.g. the "pipeline a lot of requests" case we already support RPC batching.
RPC is meant as a local mechanism, so any latency for setting up a new TCP connection will be dwarfed by, say, parsing overhead.

Looks like something that is reasonably hard to handle right in client libraries, although OTOH they could just send a Connection: close header.

On the server side, a well-behaved implementation would disconnect if no new request comes in after ~15 seconds. Which we don't.

Also taking #5526 into account: +1 for nuking this functionality (or at least disabling by default) for now.

jonasschnelli · 2015-01-14T08:04:23Z

Because i recently analyzed the http behavior i allow myself to comment on this:
IMO the problem of blocking or say open connections from/to clients is the missing timeout implementation. Disabling http keep-alive option alone will not prevent from dead-and-open connection because the implementation currently uses the blocking getline(). If a malfunction client is not sending a valid http request, it can end up with a never closed connection until a restart of bitcoind.

But right, the keep-alive option is questionable. I assume most RPC connection are coming from localhost or LAN. IMO there is no big benefit of saving http overhead with keep-alive. Therefor i can ACK disabling keep-alive in general or over the flag.

But still there is a need for implementing a non-blocking way of handling connection with a proper http timeout according to #5526 (see @jgarzik info there).

gmaxwell · 2015-01-14T08:35:58Z

Based on conversation with @laanwj I've switched this off by default. This is actually my preference; but I expected to be alone in the opinion that we should just shut it off until it works better, at least for 0.10.

jonasschnelli · 2015-01-14T09:43:49Z

ACK
nit: you need to remove or rewrite the recently added rcp-test (httpbasics.py)

laanwj · 2015-01-14T11:54:26Z

@jonaschnelli Looks like the test framework has no easy way to pass extra arguments to bitcoind :(
Added it in https://github.com/laanwj/bitcoin/tree/2015_01_fix_tests_for_5655

@gmaxwell you can cherry-pick laanwj@4c65e99 to make this pass travis again

jonasschnelli · 2015-01-14T12:16:11Z

@laanwj: test looks good but I would recommend to also add a tiny check if keep-alive is disabled when starting a node without the new arg.

I can also add a commit on top of yours if you like to have this additional test but lack of time to do it.

laanwj · 2015-01-14T12:19:02Z

@jonasschnelli more tests is always good, you could e.g. start one of the four nodes without the option (it passes extraargs for every node individually) and then try that out. I don't have time right now.

Diapolo · 2015-01-14T14:25:08Z

src/init.cpp

@@ -380,6 +380,7 @@ std::string HelpMessage(HelpMessageMode mode)
    strUsage += "  -rpcport=<port>        " + strprintf(_("Listen for JSON-RPC connections on <port> (default: %u or testnet: %u)"), 8332, 18332) + "\n";
    strUsage += "  -rpcallowip=<ip>       " + _("Allow JSON-RPC connections from specified source. Valid for <ip> are a single IP (e.g. 1.2.3.4), a network/netmask (e.g. 1.2.3.4/255.255.255.0) or a network/CIDR (e.g. 1.2.3.4/24). This option can be specified multiple times") + "\n";
    strUsage += "  -rpcthreads=<n>        " + strprintf(_("Set the number of threads to service RPC calls (default: %d)"), 4) + "\n";
+    strUsage += "  -rpckeepalive          " + _("RPC support for HTTP persistent connections (default: 0)") + "\n";


Can you please use the format from rpcthreads, which allows changing defaults without the need for re-translating the help message (use strprintf)?

jonasschnelli · 2015-01-14T15:37:52Z

@gmaxwell you might also pull in jonasschnelli@c9792d1
This will extend the tests.

It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better.

gmaxwell · 2015-01-14T18:06:25Z

@Diapolo Thanks for the air-cover on the translation strings.

jonasschnelli · 2015-01-15T08:09:22Z

Tested again ACK.

1dd8ee7 improve tests for #5655 (Jonas Schnelli) 56c1093 fix tests for #5655 (Wladimir J. van der Laan) 16a5c18 Add a -rpckeepalive and disable RPC use of HTTP persistent connections. (Gregory Maxwell)

It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: #5655 Rebased-From: 16a5c18 56c1093 1dd8ee7

laanwj · 2015-01-15T09:19:00Z

Cherry-picked to 0.10 branch via aaf55d2

jgarzik · 2015-01-15T13:39:02Z

Did anyone research the reason for adding keep-alive support?

Turning this option on by default will now break those clients that were fixed, yes?

jonasschnelli · 2015-01-15T13:47:39Z

#1101 is related, was basically merged via luke-jr@7ee8f04

See also: #214 (comment)

Once there was a open PR of a asio implementation of the RPC server (#214 wonder why this never made it to the master).

laanwj · 2015-01-15T14:14:58Z

@jgarzik The reason clients break is a combination of a bug in the client - not effectively reusing open connections - and misbehavior at our side - not properly timing out unused connections.

Disabling keep-alive is always safe: clients know that they have to reconnect if the server slams the connection (especially as the Connection: Close header is sent).

The option can be safely reenabled by default again as soon as the problems on our side are fixed. Hopefully the client issues are also solved by that time, but it's not necessary as we'll be robust against them.

@jonasschnelli Probably because there was no testing suite for RPC back then. Swapping the RPC server implementation is extremely risky if you are working blindly.

jgarzik · 2015-01-15T19:00:23Z

@laanwj "Disabling keep-alive is always safe" Wrong. This demonstrates the need to research why a code change was added before disabling it!

The keep-alive feature was added [in part] because the repeated open/close cycles of many clients would eventually cause a hang due to so many local sockets in TIME_WAIT.

Disabling keep-alive will resurrect that problem, introducing hangs into working client scenarios.

Telling users with working setups to rewrite their working code to use batches is a big request.

jonasschnelli · 2015-01-15T19:33:09Z

@jgarzik's concerns sounds realistic to me.
What if we just switch the default for -rpckeepalive from the current false to true?
Then users with client behaviors after @gmaxwell can still disable it.

But the long term goal are unclear to me now.
Would it make sense to rewrite the RPC server into the direction of https://github.com/jgarzik/rpcsrv?
Gavin also wrote today on IRC:

If we did RPC again, I think a binary zeromq channel would be the way to go. Maybe with a little python listener that did JSON, if you want JSON.....

Maybe there should be a consensus direction to where the RPC implementations should go to so somebody could start working on it.

jgarzik · 2015-01-15T20:42:56Z

This is an example of a bug that this change re-introduces to bitcoind: #344

Or just google for bitcoind TIME_WAIT.

Merging this into 0.10 creates obvious regressions.

gmaxwell · 2015-01-15T21:14:48Z

Jeff, I believe I read every commit and PR related to the introduction of persistent connections. Not a single one of them mentioned this. I also initially created this a a default no change in behavior in the interest of being conservative, but unfortunately; you can't tell if it's the broken persistent connection support thats breaking you: it just looks like Bitcoin core is unresponsive.

In the future if a commit fixes a problem it would be helpful if it were mentioned in the commit message.

The last round of changes that impacted our persistent connections was a massive regression which caused complaints from many users-- many more than the time_wait complaint. Apparently after our failure to do anything about it many people just stuck on 0.8.1. This is particularly problematic because some of the most frequently impacted parties were miners. I've also been told at conferences by various service providers that they'd switched to using hosted APIs due to Bitcoin core being unresponsive, which made no sense to me at the time. In the new light of being reminded why miners were staying stuck on 0.8.1, it finally does make sense.

I'm open to having persist default to on (after all, that was my original PR), but I think that doing so may be a mistake unless we really have reason to believe the timewait problems are more frequent than the others. I also think persistent connections are not the right solution for timewait in any case: isn't the correct solution SO_REUSEADDR?

(I'm travelling today and won't be submitting any commits, so don't feel like waiting for me is required)

jgarzik · 2015-01-15T21:29:38Z

The original genesis of keep-alive support is that it was originally written (by JoelKatz?) to work around problems that miners were seeing with the original open/close/open/close behavior that leaves behind reams of TIME_WAIT etc. sockets, eventually leading to a non-responsive bitcoind.

RE SO_REUSEADDR, that is for servers restarting, not clients reconnecting. It does not help this situation, where the server (pre-KA & post-KA) is not closing+reopening the server socket.

Defaulting KA-off in 0.10 with little testing and thought to user impact is not a good choice so late in the release cycle.

For users whom this will actually break, the only salve we have offered is
e.g. the "pipeline a lot of requests" case we already support RPC batching.

That is not what we should be telling production users with code that works today.

gmaxwell · 2015-01-16T01:45:02Z

@jgarzik "the only salve we have offered is" not so: "turn keep alive back on"-- it's a flag, it's documented in the help output. I'm not making an argument if this is sufficient or not; just pointing it out: I wouldn't introduce a change in behavior like this without at least a switch to get the old behavior back.

We have multiple reports that the software is unusable since 0.8.1. We got a rash of complaints even from the few who figured out what was up, to the point where people stopped complaining but stayed with old versions. Doing nothing at all isn't acceptable either. It's not acceptable for us to do nothing. Again.

Sorry that I'm not maximally up to speed on the issues people saw with sockets in time_wait; it's not super well documented (even on that complaint) what the issue is: My current understanding is that clients are making connections and closing them. The client's outbound socket sticks around in TIME_WAIT for two minutes (with common TCP settings). If the client makes more than about 275 connections (32768 high ports vs 120 seconds) per second, they risk running out of ephemeral sockets toward the particular server address:port.. If so, I can see that behavior, but my quick test while on an airplane wasn't able to make enough requests per second to actually observe it running out.

Do we have a reproduction? The issue this change addressed is easily reproducible and has been reported many times now and considering how tricky it is to figure out the cause (instead of mistaking it for 'bitcoind hags') that is quite concerning.

jgarzik · 2015-01-16T01:57:51Z

The reproducer was typically high speed 'getwork' traffic from generation-1 pool server proxies, but non-miners were also able to make it fail with high speed wallet RPCs. 275/second would be a low number versus historical examples.

RE flag: It's still a behavior change that forces clients to fix their software -- software that has been working for years -- with very little warning.

This behavior change was merged at the last minute during an -rc cycle. It is simply too risky to turn keep-alive off by default in 0.10. For 0.11, with plenty of user warning ("you might need to change your code"), it can be considered, sure.

Strong NAK for 0.10 behavior change. The same default should be kept for 0.10.

It is a strawman to argue against "doing nothing" Ship the feature in 0.10 if you like, but keep the current default to prevent breaking even more users. Miners can disable -rpckeepalive just as easily, to prove the new code works better than the old.

laanwj · 2015-01-16T08:40:58Z

@jgarzik I don't feel like arguing this anymore. Damned if we do, damned if we don't. I get tired of this. Let's keep the option but feel free to change the default to 1 again. I'll ACK it.

jgarzik · 2015-01-16T09:14:48Z

If people want the feature, a properly release-managed plan would not introduce this regression in 0.10 -- particularly so late in the 0.10 release cycle:

0.10 defaults to 0.9.x behavior to prevent further regressions (keep-alive on)
Communicate to the users that you are introducing a change that might break their software in 0.11
Then turn keep-alive off in master if that's what people want.

Fixing/preventing regressions should trump introducing new, speculative fixes based on conference anecdotes. Lack of Keep-Alive caused problems in 2011, and KA-off clearly reverts back to pre-fix behavior.

Should we address these new concerns? Absolutely, but not by rushing KA-off into release at the last moment. Sane release management policy implies KA-on for 0.10.

laanwj · 2015-01-16T09:21:54Z

My last words on this: Disabling it was meant as a quick workaround for the problems reported by @gmaxwell.

We perceived that in its current implementation the keep-alive does more damage than it's worth. I was not aware that the other way around also could break clients. Thanks for pointing to the issue.

Hence, I'm fine with changing the default back to 1 for 0.10, and for master too. If this goes into the long haul, might as well fix keep-alive instead instead of having a phase where it's disabled by default.
People experiencing issues can use -rpckeepalive=0.

sipa · 2015-01-16T10:38:54Z

ACK on adding the option, bu lt leaving it enabled by default.

It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: bitcoin#5655 Rebased-From: 16a5c18 56c1093 1dd8ee7

- *Replace usage of boost::asio with [libevent2](http://libevent.org/)*. boost::asio is not part of C++11, so unlike other boost there is no forwards-compatibility reason to stick with it. Together with bitcoin#4738 (convert json_spirit to UniValue), this rids Bitcoin Core of the worst offenders with regard to compile-time slowness. - *Replace spit-and-duct-tape http server with evhttp*. Front-end http handling is handled by libevent, a work queue (with configurable depth and parallelism) is used to handle application requests. - *Wrap HTTP request in C++ class*; this makes the application code mostly HTTP-server-neutral - *Refactor RPC to move all http-specific code to a separate file*. Theoreticaly this can allow building without HTTP server but with another RPC backend, e.g. Qt's debug console (currently not implemented) or future RPC mechanisms people may want to use. - *HTTP dispatch mechanism*; services (e.g., RPC, REST) register which URL paths they want to handle. By using a proven, high-performance asynchronous networking library (also used by Tor) and HTTP server, problems such as bitcoin#5674, bitcoin#5655, bitcoin#344 should be avoided. What works? bitcoind, bitcoin-cli, bitcoin-qt. Unit tests and RPC/REST tests pass. The aim for now is everything but SSL support. Configuration options: - `-rpcthreads`: repurposed as "number of work handler threads". Still defaults to 4. - `-rpcworkqueue`: maximum depth of work queue. When this is reached, new requests will return a 500 Internal Error. - `-rpctimeout`: inactivity time, in seconds, after which to disconnect a client. - `-debug=http`: low-level http activity logging

It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: bitcoin#5655 Rebased-From: 16a5c18 56c1093 1dd8ee7 (cherry picked from commit aaf55d2)

gmaxwell added this to the 0.10.0 milestone Jan 14, 2015

gmaxwell changed the title ~~Add a -disablerpckeepalive to disable RPC HTTP persistent connections.~~ Add a -rpckeepalive to control RPC use of HTTP persistent connections. Jan 14, 2015

gmaxwell changed the title ~~Add a -rpckeepalive to control RPC use of HTTP persistent connections.~~ Add a -rpckeepalive and disable RPC use of HTTP persistent connections. Jan 14, 2015

laanwj added a commit to laanwj/bitcoin that referenced this pull request Jan 14, 2015

fix tests for bitcoin#5655

4c65e99

Diapolo reviewed Jan 14, 2015
View reviewed changes

jonasschnelli added a commit to jonasschnelli/bitcoin that referenced this pull request Jan 14, 2015

improve tests for bitcoin#5655

c9792d1

gmaxwell and others added 3 commits January 14, 2015 09:49

fix tests for #5655

56c1093

improve tests for #5655

1dd8ee7

laanwj merged commit 1dd8ee7 into bitcoin:master Jan 15, 2015

laanwj mentioned this pull request Jan 18, 2015

libevent-based http server #5677

Merged

5 tasks

kanzure mentioned this pull request Jul 17, 2015

Unresponsive RPC server and RPC threads hanging #6454

Closed

Ban44n mentioned this pull request Oct 21, 2016

Merge libevent-based http server from Bitcoin for better stability in RPC zcash/zcash#1593

Closed

str4d mentioned this pull request Mar 26, 2017

Re-enable HTTP persistent connections (keep-alive) for RPC zcash/zcash#1856

Closed

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

Add a -rpckeepalive and disable RPC use of HTTP persistent connections. #5655

Add a -rpckeepalive and disable RPC use of HTTP persistent connections. #5655

Uh oh!

Conversation

gmaxwell commented Jan 14, 2015

Uh oh!

sipa commented Jan 14, 2015

Uh oh!

gmaxwell commented Jan 14, 2015

Uh oh!

jonasschnelli commented Jan 14, 2015

Uh oh!

laanwj commented Jan 14, 2015

Uh oh!

jonasschnelli commented Jan 14, 2015

Uh oh!

gmaxwell commented Jan 14, 2015

Uh oh!

jonasschnelli commented Jan 14, 2015

Uh oh!

laanwj commented Jan 14, 2015

Uh oh!

jonasschnelli commented Jan 14, 2015

Uh oh!

laanwj commented Jan 14, 2015

Uh oh!

Diapolo Jan 14, 2015

Choose a reason for hiding this comment

Uh oh!

jonasschnelli commented Jan 14, 2015

Uh oh!

gmaxwell commented Jan 14, 2015

Uh oh!

jonasschnelli commented Jan 15, 2015

Uh oh!

laanwj commented Jan 15, 2015

Uh oh!

jgarzik commented Jan 15, 2015

Uh oh!

jonasschnelli commented Jan 15, 2015

Uh oh!

laanwj commented Jan 15, 2015

Uh oh!

jgarzik commented Jan 15, 2015

Uh oh!

jonasschnelli commented Jan 15, 2015

Uh oh!

jgarzik commented Jan 15, 2015

Uh oh!

gmaxwell commented Jan 15, 2015

Uh oh!

jgarzik commented Jan 15, 2015

Uh oh!

gmaxwell commented Jan 16, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgarzik commented Jan 16, 2015

Uh oh!

laanwj commented Jan 16, 2015

Uh oh!

jgarzik commented Jan 16, 2015

Uh oh!

laanwj commented Jan 16, 2015

Uh oh!

sipa commented Jan 16, 2015

Uh oh!

Uh oh!

gmaxwell commented Jan 16, 2015 •

edited

Loading