Skip to content

Conversation

gmaxwell
Copy link
Contributor

It turns out that some miners have been staying with old versions of
Bitcoin Core because their software behaves poorly with persistent
connections and the Bitcoin Core thread and connection limits.

What happens is that underlying HTTP libraries leave connections open
invisibly to their users and then the user runs into the default four
thread limit. This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
here, e.g. supporting more concurrent connections, not tying up threads
for idle connections, disconnecting kept-alive connections when limits
are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
off the keep-alive support in the client where it may be buried in some
platform library.

If you are one of the few who really needs persistent connections you
probably know that you want them and can find a switch; while if you
don't and the misbehavior is hitting you it is hard to discover the
source of your problems is keepalive related. Given that it is best
to default to off until they're handled better.

@gmaxwell gmaxwell added this to the 0.10.0 milestone Jan 14, 2015
@sipa
Copy link
Member

sipa commented Jan 14, 2015

Nitty suggestion: call it -rpckeepalive with default value 1.

@gmaxwell gmaxwell changed the title Add a -disablerpckeepalive to disable RPC HTTP persistent connections. Add a -rpckeepalive to control RPC use of HTTP persistent connections. Jan 14, 2015
@gmaxwell
Copy link
Contributor Author

Fair enough.

@jonasschnelli
Copy link
Contributor

Somehow related to #5526.

@laanwj
Copy link
Member

laanwj commented Jan 14, 2015

Is there any practical advantage to having http keep-alive support in Bitcoin Core?
e.g. the "pipeline a lot of requests" case we already support RPC batching.
RPC is meant as a local mechanism, so any latency for setting up a new TCP connection will be dwarfed by, say, parsing overhead.

Looks like something that is reasonably hard to handle right in client libraries, although OTOH they could just send a Connection: close header.

On the server side, a well-behaved implementation would disconnect if no new request comes in after ~15 seconds. Which we don't.

Also taking #5526 into account: +1 for nuking this functionality (or at least disabling by default) for now.

@jonasschnelli
Copy link
Contributor

Because i recently analyzed the http behavior i allow myself to comment on this:
IMO the problem of blocking or say open connections from/to clients is the missing timeout implementation. Disabling http keep-alive option alone will not prevent from dead-and-open connection because the implementation currently uses the blocking getline(). If a malfunction client is not sending a valid http request, it can end up with a never closed connection until a restart of bitcoind.

But right, the keep-alive option is questionable. I assume most RPC connection are coming from localhost or LAN. IMO there is no big benefit of saving http overhead with keep-alive. Therefor i can ACK disabling keep-alive in general or over the flag.

But still there is a need for implementing a non-blocking way of handling connection with a proper http timeout according to #5526 (see @jgarzik info there).

@gmaxwell gmaxwell changed the title Add a -rpckeepalive to control RPC use of HTTP persistent connections. Add a -rpckeepalive and disable RPC use of HTTP persistent connections. Jan 14, 2015
@gmaxwell
Copy link
Contributor Author

Based on conversation with @laanwj I've switched this off by default. This is actually my preference; but I expected to be alone in the opinion that we should just shut it off until it works better, at least for 0.10.

@jonasschnelli
Copy link
Contributor

ACK
nit: you need to remove or rewrite the recently added rcp-test (httpbasics.py)

laanwj added a commit to laanwj/bitcoin that referenced this pull request Jan 14, 2015
@laanwj
Copy link
Member

laanwj commented Jan 14, 2015

@jonaschnelli Looks like the test framework has no easy way to pass extra arguments to bitcoind :(
Added it in https://github.com/laanwj/bitcoin/tree/2015_01_fix_tests_for_5655

@gmaxwell you can cherry-pick laanwj@4c65e99 to make this pass travis again

@jonasschnelli
Copy link
Contributor

@laanwj: test looks good but I would recommend to also add a tiny check if keep-alive is disabled when starting a node without the new arg.

I can also add a commit on top of yours if you like to have this additional test but lack of time to do it.

@laanwj
Copy link
Member

laanwj commented Jan 14, 2015

@jonasschnelli more tests is always good, you could e.g. start one of the four nodes without the option (it passes extraargs for every node individually) and then try that out. I don't have time right now.

@@ -380,6 +380,7 @@ std::string HelpMessage(HelpMessageMode mode)
strUsage += " -rpcport=<port> " + strprintf(_("Listen for JSON-RPC connections on <port> (default: %u or testnet: %u)"), 8332, 18332) + "\n";
strUsage += " -rpcallowip=<ip> " + _("Allow JSON-RPC connections from specified source. Valid for <ip> are a single IP (e.g. 1.2.3.4), a network/netmask (e.g. 1.2.3.4/255.255.255.0) or a network/CIDR (e.g. 1.2.3.4/24). This option can be specified multiple times") + "\n";
strUsage += " -rpcthreads=<n> " + strprintf(_("Set the number of threads to service RPC calls (default: %d)"), 4) + "\n";
strUsage += " -rpckeepalive " + _("RPC support for HTTP persistent connections (default: 0)") + "\n";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use the format from rpcthreads, which allows changing defaults without the need for re-translating the help message (use strprintf)?

jonasschnelli added a commit to jonasschnelli/bitcoin that referenced this pull request Jan 14, 2015
@jonasschnelli
Copy link
Contributor

@gmaxwell you might also pull in jonasschnelli@c9792d1
This will extend the tests.

gmaxwell and others added 3 commits January 14, 2015 09:49
It turns out that some miners have been staying with old versions of
 Bitcoin Core because their software  behaves poorly with persistent
 connections and the Bitcoin Core thread and connection limits.

What happens is that underlying  HTTP libraries leave connections open
 invisibly to their users and then the user runs into the default four
 thread limit.  This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
 here, e.g. supporting more concurrent connections, not tying up threads
 for idle connections, disconnecting kept-alive  connections when limits
 are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
 off the keep-alive support in the client where it may be buried in some
 platform library.

If you are one of the few who really needs persistent connections you
 probably know that you want them and can find a switch; while if you
 don't and the misbehavior is hitting you it is hard to discover the
 source of your problems is keepalive related.  Given that it is best
 to default to off until they're handled better.
@gmaxwell
Copy link
Contributor Author

@Diapolo Thanks for the air-cover on the translation strings.

@jonasschnelli
Copy link
Contributor

Tested again ACK.

@laanwj laanwj merged commit 1dd8ee7 into bitcoin:master Jan 15, 2015
laanwj added a commit that referenced this pull request Jan 15, 2015
1dd8ee7 improve tests for #5655 (Jonas Schnelli)
56c1093 fix tests for #5655 (Wladimir J. van der Laan)
16a5c18 Add a -rpckeepalive and disable RPC use of HTTP persistent connections. (Gregory Maxwell)
gmaxwell added a commit that referenced this pull request Jan 15, 2015
It turns out that some miners have been staying with old versions of
 Bitcoin Core because their software  behaves poorly with persistent
 connections and the Bitcoin Core thread and connection limits.

What happens is that underlying  HTTP libraries leave connections open
 invisibly to their users and then the user runs into the default four
 thread limit.  This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
 here, e.g. supporting more concurrent connections, not tying up threads
 for idle connections, disconnecting kept-alive  connections when limits
 are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
 off the keep-alive support in the client where it may be buried in some
 platform library.

If you are one of the few who really needs persistent connections you
 probably know that you want them and can find a switch; while if you
 don't and the misbehavior is hitting you it is hard to discover the
 source of your problems is keepalive related.  Given that it is best
 to default to off until they're handled better.

Github-Merge: #5655
Rebased-From: 16a5c18 56c1093 1dd8ee7
@laanwj
Copy link
Member

laanwj commented Jan 15, 2015

Cherry-picked to 0.10 branch via aaf55d2

@jgarzik
Copy link
Contributor

jgarzik commented Jan 15, 2015

Did anyone research the reason for adding keep-alive support?

Turning this option on by default will now break those clients that were fixed, yes?

@jonasschnelli
Copy link
Contributor

#1101 is related, was basically merged via luke-jr@7ee8f04

See also: #214 (comment)

Once there was a open PR of a asio implementation of the RPC server (#214 wonder why this never made it to the master).

@laanwj
Copy link
Member

laanwj commented Jan 15, 2015

@jgarzik The reason clients break is a combination of a bug in the client - not effectively reusing open connections - and misbehavior at our side - not properly timing out unused connections.

Disabling keep-alive is always safe: clients know that they have to reconnect if the server slams the connection (especially as the Connection: Close header is sent).

The option can be safely reenabled by default again as soon as the problems on our side are fixed. Hopefully the client issues are also solved by that time, but it's not necessary as we'll be robust against them.

@jonasschnelli Probably because there was no testing suite for RPC back then. Swapping the RPC server implementation is extremely risky if you are working blindly.

@jgarzik
Copy link
Contributor

jgarzik commented Jan 15, 2015

@laanwj "Disabling keep-alive is always safe" Wrong. This demonstrates the need to research why a code change was added before disabling it!

The keep-alive feature was added [in part] because the repeated open/close cycles of many clients would eventually cause a hang due to so many local sockets in TIME_WAIT.

Disabling keep-alive will resurrect that problem, introducing hangs into working client scenarios.

Telling users with working setups to rewrite their working code to use batches is a big request.

@jonasschnelli
Copy link
Contributor

@jgarzik's concerns sounds realistic to me.
What if we just switch the default for -rpckeepalive from the current false to true?
Then users with client behaviors after @gmaxwell can still disable it.

But the long term goal are unclear to me now.
Would it make sense to rewrite the RPC server into the direction of https://github.com/jgarzik/rpcsrv?
Gavin also wrote today on IRC:

If we did RPC again, I think a binary zeromq channel would be the way to go. Maybe with a little python listener that did JSON, if you want JSON.....

Maybe there should be a consensus direction to where the RPC implementations should go to so somebody could start working on it.

@jgarzik
Copy link
Contributor

jgarzik commented Jan 15, 2015

This is an example of a bug that this change re-introduces to bitcoind: #344

Or just google for bitcoind TIME_WAIT.

Merging this into 0.10 creates obvious regressions.

@gmaxwell
Copy link
Contributor Author

Jeff, I believe I read every commit and PR related to the introduction of persistent connections. Not a single one of them mentioned this. I also initially created this a a default no change in behavior in the interest of being conservative, but unfortunately; you can't tell if it's the broken persistent connection support thats breaking you: it just looks like Bitcoin core is unresponsive.

In the future if a commit fixes a problem it would be helpful if it were mentioned in the commit message.

The last round of changes that impacted our persistent connections was a massive regression which caused complaints from many users-- many more than the time_wait complaint. Apparently after our failure to do anything about it many people just stuck on 0.8.1. This is particularly problematic because some of the most frequently impacted parties were miners. I've also been told at conferences by various service providers that they'd switched to using hosted APIs due to Bitcoin core being unresponsive, which made no sense to me at the time. In the new light of being reminded why miners were staying stuck on 0.8.1, it finally does make sense.

I'm open to having persist default to on (after all, that was my original PR), but I think that doing so may be a mistake unless we really have reason to believe the timewait problems are more frequent than the others. I also think persistent connections are not the right solution for timewait in any case: isn't the correct solution SO_REUSEADDR?

(I'm travelling today and won't be submitting any commits, so don't feel like waiting for me is required)

@jgarzik
Copy link
Contributor

jgarzik commented Jan 15, 2015

The original genesis of keep-alive support is that it was originally written (by JoelKatz?) to work around problems that miners were seeing with the original open/close/open/close behavior that leaves behind reams of TIME_WAIT etc. sockets, eventually leading to a non-responsive bitcoind.

RE SO_REUSEADDR, that is for servers restarting, not clients reconnecting. It does not help this situation, where the server (pre-KA & post-KA) is not closing+reopening the server socket.

Defaulting KA-off in 0.10 with little testing and thought to user impact is not a good choice so late in the release cycle.

For users whom this will actually break, the only salve we have offered is
e.g. the "pipeline a lot of requests" case we already support RPC batching.

That is not what we should be telling production users with code that works today.

@gmaxwell
Copy link
Contributor Author

gmaxwell commented Jan 16, 2015

@jgarzik "the only salve we have offered is" not so: "turn keep alive back on"-- it's a flag, it's documented in the help output. I'm not making an argument if this is sufficient or not; just pointing it out: I wouldn't introduce a change in behavior like this without at least a switch to get the old behavior back.

We have multiple reports that the software is unusable since 0.8.1. We got a rash of complaints even from the few who figured out what was up, to the point where people stopped complaining but stayed with old versions. Doing nothing at all isn't acceptable either. It's not acceptable for us to do nothing. Again.

Sorry that I'm not maximally up to speed on the issues people saw with sockets in time_wait; it's not super well documented (even on that complaint) what the issue is: My current understanding is that clients are making connections and closing them. The client's outbound socket sticks around in TIME_WAIT for two minutes (with common TCP settings). If the client makes more than about 275 connections (32768 high ports vs 120 seconds) per second, they risk running out of ephemeral sockets toward the particular server address:port.. If so, I can see that behavior, but my quick test while on an airplane wasn't able to make enough requests per second to actually observe it running out.

Do we have a reproduction? The issue this change addressed is easily reproducible and has been reported many times now and considering how tricky it is to figure out the cause (instead of mistaking it for 'bitcoind hags') that is quite concerning.

@jgarzik
Copy link
Contributor

jgarzik commented Jan 16, 2015

The reproducer was typically high speed 'getwork' traffic from generation-1 pool server proxies, but non-miners were also able to make it fail with high speed wallet RPCs. 275/second would be a low number versus historical examples.

RE flag: It's still a behavior change that forces clients to fix their software -- software that has been working for years -- with very little warning.

This behavior change was merged at the last minute during an -rc cycle. It is simply too risky to turn keep-alive off by default in 0.10. For 0.11, with plenty of user warning ("you might need to change your code"), it can be considered, sure.

Strong NAK for 0.10 behavior change. The same default should be kept for 0.10.

It is a strawman to argue against "doing nothing" Ship the feature in 0.10 if you like, but keep the current default to prevent breaking even more users. Miners can disable -rpckeepalive just as easily, to prove the new code works better than the old.

@laanwj
Copy link
Member

laanwj commented Jan 16, 2015

@jgarzik I don't feel like arguing this anymore. Damned if we do, damned if we don't. I get tired of this. Let's keep the option but feel free to change the default to 1 again. I'll ACK it.

@jgarzik
Copy link
Contributor

jgarzik commented Jan 16, 2015

If people want the feature, a properly release-managed plan would not introduce this regression in 0.10 -- particularly so late in the 0.10 release cycle:

  • 0.10 defaults to 0.9.x behavior to prevent further regressions (keep-alive on)
  • Communicate to the users that you are introducing a change that might break their software in 0.11
  • Then turn keep-alive off in master if that's what people want.

Fixing/preventing regressions should trump introducing new, speculative fixes based on conference anecdotes. Lack of Keep-Alive caused problems in 2011, and KA-off clearly reverts back to pre-fix behavior.

Should we address these new concerns? Absolutely, but not by rushing KA-off into release at the last moment. Sane release management policy implies KA-on for 0.10.

@laanwj
Copy link
Member

laanwj commented Jan 16, 2015

My last words on this: Disabling it was meant as a quick workaround for the problems reported by @gmaxwell.

We perceived that in its current implementation the keep-alive does more damage than it's worth. I was not aware that the other way around also could break clients. Thanks for pointing to the issue.

Hence, I'm fine with changing the default back to 1 for 0.10, and for master too. If this goes into the long haul, might as well fix keep-alive instead instead of having a phase where it's disabled by default.
People experiencing issues can use -rpckeepalive=0.

@sipa
Copy link
Member

sipa commented Jan 16, 2015

ACK on adding the option, bu lt leaving it enabled by default.

@laanwj laanwj mentioned this pull request Jan 18, 2015
5 tasks
wtogami pushed a commit to litecoin-project/litecoin that referenced this pull request Jan 23, 2015
It turns out that some miners have been staying with old versions of
 Bitcoin Core because their software  behaves poorly with persistent
 connections and the Bitcoin Core thread and connection limits.

What happens is that underlying  HTTP libraries leave connections open
 invisibly to their users and then the user runs into the default four
 thread limit.  This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
 here, e.g. supporting more concurrent connections, not tying up threads
 for idle connections, disconnecting kept-alive  connections when limits
 are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
 off the keep-alive support in the client where it may be buried in some
 platform library.

If you are one of the few who really needs persistent connections you
 probably know that you want them and can find a switch; while if you
 don't and the misbehavior is hitting you it is hard to discover the
 source of your problems is keepalive related.  Given that it is best
 to default to off until they're handled better.

Github-Merge: bitcoin#5655
Rebased-From: 16a5c18 56c1093 1dd8ee7
laanwj added a commit to laanwj/bitcoin that referenced this pull request Sep 3, 2015
- *Replace usage of boost::asio with [libevent2](http://libevent.org/)*.
boost::asio is not part of C++11, so unlike other boost there is no
forwards-compatibility reason to stick with it. Together with bitcoin#4738 (convert
json_spirit to UniValue), this rids Bitcoin Core of the worst offenders with
regard to compile-time slowness.

- *Replace spit-and-duct-tape http server with evhttp*. Front-end http handling
is handled by libevent, a work queue (with configurable depth and parallelism)
is used to handle application requests.

- *Wrap HTTP request in C++ class*; this makes the application code mostly
HTTP-server-neutral

- *Refactor RPC to move all http-specific code to a separate file*.
Theoreticaly this can allow building without HTTP server but with another RPC
backend, e.g. Qt's debug console (currently not implemented) or future RPC
mechanisms people may want to use.

- *HTTP dispatch mechanism*; services (e.g., RPC, REST) register which URL
paths they want to handle.

By using a proven, high-performance asynchronous networking library (also used
by Tor) and HTTP server, problems such as bitcoin#5674, bitcoin#5655, bitcoin#344 should be avoided.

What works? bitcoind, bitcoin-cli, bitcoin-qt. Unit tests and RPC/REST tests
pass. The aim for now is everything but SSL support.

Configuration options:

- `-rpcthreads`: repurposed as "number of  work handler threads". Still
defaults to 4.

- `-rpcworkqueue`: maximum depth of work queue. When this is reached, new
requests will return a 500 Internal Error.

- `-rpctimeout`: inactivity time, in seconds, after which to disconnect a
client.

- `-debug=http`: low-level http activity logging
reddink pushed a commit to reddcoin-project/reddcoin-3.10 that referenced this pull request May 27, 2020
It turns out that some miners have been staying with old versions of
 Bitcoin Core because their software  behaves poorly with persistent
 connections and the Bitcoin Core thread and connection limits.

What happens is that underlying  HTTP libraries leave connections open
 invisibly to their users and then the user runs into the default four
 thread limit.  This looks like Bitcoin Core is unresponsive to RPC.

There are many things that should be improved in Bitcoin Core's behavior
 here, e.g. supporting more concurrent connections, not tying up threads
 for idle connections, disconnecting kept-alive  connections when limits
 are reached, etc. All are fairly big, risky changes.

Disabling keep-alive is a simple workaround. It's often not easy to turn
 off the keep-alive support in the client where it may be buried in some
 platform library.

If you are one of the few who really needs persistent connections you
 probably know that you want them and can find a switch; while if you
 don't and the misbehavior is hitting you it is hard to discover the
 source of your problems is keepalive related.  Given that it is best
 to default to off until they're handled better.

Github-Merge: bitcoin#5655
Rebased-From: 16a5c18 56c1093 1dd8ee7
(cherry picked from commit aaf55d2)
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants