-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Add a -rpckeepalive and disable RPC use of HTTP persistent connections. #5655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nitty suggestion: call it -rpckeepalive with default value 1. |
Fair enough. |
Somehow related to #5526. |
Is there any practical advantage to having http keep-alive support in Bitcoin Core? Looks like something that is reasonably hard to handle right in client libraries, although OTOH they could just send a On the server side, a well-behaved implementation would disconnect if no new request comes in after ~15 seconds. Which we don't. Also taking #5526 into account: +1 for nuking this functionality (or at least disabling by default) for now. |
Because i recently analyzed the http behavior i allow myself to comment on this: But right, the keep-alive option is questionable. I assume most RPC connection are coming from localhost or LAN. IMO there is no big benefit of saving http overhead with keep-alive. Therefor i can ACK disabling keep-alive in general or over the flag. But still there is a need for implementing a non-blocking way of handling connection with a proper http timeout according to #5526 (see @jgarzik info there). |
Based on conversation with @laanwj I've switched this off by default. This is actually my preference; but I expected to be alone in the opinion that we should just shut it off until it works better, at least for 0.10. |
ACK |
@jonaschnelli Looks like the test framework has no easy way to pass extra arguments to bitcoind :( @gmaxwell you can cherry-pick laanwj@4c65e99 to make this pass travis again |
@laanwj: test looks good but I would recommend to also add a tiny check if keep-alive is disabled when starting a node without the new arg. I can also add a commit on top of yours if you like to have this additional test but lack of time to do it. |
@jonasschnelli more tests is always good, you could e.g. start one of the four nodes without the option (it passes extraargs for every node individually) and then try that out. I don't have time right now. |
@@ -380,6 +380,7 @@ std::string HelpMessage(HelpMessageMode mode) | |||
strUsage += " -rpcport=<port> " + strprintf(_("Listen for JSON-RPC connections on <port> (default: %u or testnet: %u)"), 8332, 18332) + "\n"; | |||
strUsage += " -rpcallowip=<ip> " + _("Allow JSON-RPC connections from specified source. Valid for <ip> are a single IP (e.g. 1.2.3.4), a network/netmask (e.g. 1.2.3.4/255.255.255.0) or a network/CIDR (e.g. 1.2.3.4/24). This option can be specified multiple times") + "\n"; | |||
strUsage += " -rpcthreads=<n> " + strprintf(_("Set the number of threads to service RPC calls (default: %d)"), 4) + "\n"; | |||
strUsage += " -rpckeepalive " + _("RPC support for HTTP persistent connections (default: 0)") + "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use the format from rpcthreads, which allows changing defaults without the need for re-translating the help message (use strprintf
)?
@gmaxwell you might also pull in jonasschnelli@c9792d1 |
It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better.
@Diapolo Thanks for the air-cover on the translation strings. |
Tested again ACK. |
It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: #5655 Rebased-From: 16a5c18 56c1093 1dd8ee7
Cherry-picked to 0.10 branch via aaf55d2 |
Did anyone research the reason for adding keep-alive support? Turning this option on by default will now break those clients that were fixed, yes? |
#1101 is related, was basically merged via luke-jr@7ee8f04 See also: #214 (comment) Once there was a open PR of a asio implementation of the RPC server (#214 wonder why this never made it to the master). |
@jgarzik The reason clients break is a combination of a bug in the client - not effectively reusing open connections - and misbehavior at our side - not properly timing out unused connections. Disabling keep-alive is always safe: clients know that they have to reconnect if the server slams the connection (especially as the Connection: Close header is sent). The option can be safely reenabled by default again as soon as the problems on our side are fixed. Hopefully the client issues are also solved by that time, but it's not necessary as we'll be robust against them. @jonasschnelli Probably because there was no testing suite for RPC back then. Swapping the RPC server implementation is extremely risky if you are working blindly. |
@laanwj "Disabling keep-alive is always safe" Wrong. This demonstrates the need to research why a code change was added before disabling it! The keep-alive feature was added [in part] because the repeated open/close cycles of many clients would eventually cause a hang due to so many local sockets in TIME_WAIT. Disabling keep-alive will resurrect that problem, introducing hangs into working client scenarios. Telling users with working setups to rewrite their working code to use batches is a big request. |
@jgarzik's concerns sounds realistic to me. But the long term goal are unclear to me now.
Maybe there should be a consensus direction to where the RPC implementations should go to so somebody could start working on it. |
This is an example of a bug that this change re-introduces to bitcoind: #344 Or just google for bitcoind TIME_WAIT. Merging this into 0.10 creates obvious regressions. |
Jeff, I believe I read every commit and PR related to the introduction of persistent connections. Not a single one of them mentioned this. I also initially created this a a default no change in behavior in the interest of being conservative, but unfortunately; you can't tell if it's the broken persistent connection support thats breaking you: it just looks like Bitcoin core is unresponsive. In the future if a commit fixes a problem it would be helpful if it were mentioned in the commit message. The last round of changes that impacted our persistent connections was a massive regression which caused complaints from many users-- many more than the time_wait complaint. Apparently after our failure to do anything about it many people just stuck on 0.8.1. This is particularly problematic because some of the most frequently impacted parties were miners. I've also been told at conferences by various service providers that they'd switched to using hosted APIs due to Bitcoin core being unresponsive, which made no sense to me at the time. In the new light of being reminded why miners were staying stuck on 0.8.1, it finally does make sense. I'm open to having persist default to on (after all, that was my original PR), but I think that doing so may be a mistake unless we really have reason to believe the timewait problems are more frequent than the others. I also think persistent connections are not the right solution for timewait in any case: isn't the correct solution SO_REUSEADDR? (I'm travelling today and won't be submitting any commits, so don't feel like waiting for me is required) |
The original genesis of keep-alive support is that it was originally written (by JoelKatz?) to work around problems that miners were seeing with the original open/close/open/close behavior that leaves behind reams of TIME_WAIT etc. sockets, eventually leading to a non-responsive bitcoind. RE Defaulting KA-off in 0.10 with little testing and thought to user impact is not a good choice so late in the release cycle. For users whom this will actually break, the only salve we have offered is That is not what we should be telling production users with code that works today. |
@jgarzik "the only salve we have offered is" not so: "turn keep alive back on"-- it's a flag, it's documented in the help output. I'm not making an argument if this is sufficient or not; just pointing it out: I wouldn't introduce a change in behavior like this without at least a switch to get the old behavior back. We have multiple reports that the software is unusable since 0.8.1. We got a rash of complaints even from the few who figured out what was up, to the point where people stopped complaining but stayed with old versions. Doing nothing at all isn't acceptable either. It's not acceptable for us to do nothing. Again. Sorry that I'm not maximally up to speed on the issues people saw with sockets in time_wait; it's not super well documented (even on that complaint) what the issue is: My current understanding is that clients are making connections and closing them. The client's outbound socket sticks around in TIME_WAIT for two minutes (with common TCP settings). If the client makes more than about 275 connections (32768 high ports vs 120 seconds) per second, they risk running out of ephemeral sockets toward the particular server address:port.. If so, I can see that behavior, but my quick test while on an airplane wasn't able to make enough requests per second to actually observe it running out. Do we have a reproduction? The issue this change addressed is easily reproducible and has been reported many times now and considering how tricky it is to figure out the cause (instead of mistaking it for 'bitcoind hags') that is quite concerning. |
The reproducer was typically high speed 'getwork' traffic from generation-1 pool server proxies, but non-miners were also able to make it fail with high speed wallet RPCs. 275/second would be a low number versus historical examples. RE flag: It's still a behavior change that forces clients to fix their software -- software that has been working for years -- with very little warning. This behavior change was merged at the last minute during an -rc cycle. It is simply too risky to turn keep-alive off by default in 0.10. For 0.11, with plenty of user warning ("you might need to change your code"), it can be considered, sure. Strong NAK for 0.10 behavior change. The same default should be kept for 0.10. It is a strawman to argue against "doing nothing" Ship the feature in 0.10 if you like, but keep the current default to prevent breaking even more users. Miners can disable -rpckeepalive just as easily, to prove the new code works better than the old. |
@jgarzik I don't feel like arguing this anymore. Damned if we do, damned if we don't. I get tired of this. Let's keep the option but feel free to change the default to 1 again. I'll ACK it. |
If people want the feature, a properly release-managed plan would not introduce this regression in 0.10 -- particularly so late in the 0.10 release cycle:
Fixing/preventing regressions should trump introducing new, speculative fixes based on conference anecdotes. Lack of Keep-Alive caused problems in 2011, and KA-off clearly reverts back to pre-fix behavior. Should we address these new concerns? Absolutely, but not by rushing KA-off into release at the last moment. Sane release management policy implies KA-on for 0.10. |
My last words on this: Disabling it was meant as a quick workaround for the problems reported by @gmaxwell. We perceived that in its current implementation the keep-alive does more damage than it's worth. I was not aware that the other way around also could break clients. Thanks for pointing to the issue. Hence, I'm fine with changing the default back to 1 for 0.10, and for master too. If this goes into the long haul, might as well fix keep-alive instead instead of having a phase where it's disabled by default. |
ACK on adding the option, bu lt leaving it enabled by default. |
It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: bitcoin#5655 Rebased-From: 16a5c18 56c1093 1dd8ee7
- *Replace usage of boost::asio with [libevent2](http://libevent.org/)*. boost::asio is not part of C++11, so unlike other boost there is no forwards-compatibility reason to stick with it. Together with bitcoin#4738 (convert json_spirit to UniValue), this rids Bitcoin Core of the worst offenders with regard to compile-time slowness. - *Replace spit-and-duct-tape http server with evhttp*. Front-end http handling is handled by libevent, a work queue (with configurable depth and parallelism) is used to handle application requests. - *Wrap HTTP request in C++ class*; this makes the application code mostly HTTP-server-neutral - *Refactor RPC to move all http-specific code to a separate file*. Theoreticaly this can allow building without HTTP server but with another RPC backend, e.g. Qt's debug console (currently not implemented) or future RPC mechanisms people may want to use. - *HTTP dispatch mechanism*; services (e.g., RPC, REST) register which URL paths they want to handle. By using a proven, high-performance asynchronous networking library (also used by Tor) and HTTP server, problems such as bitcoin#5674, bitcoin#5655, bitcoin#344 should be avoided. What works? bitcoind, bitcoin-cli, bitcoin-qt. Unit tests and RPC/REST tests pass. The aim for now is everything but SSL support. Configuration options: - `-rpcthreads`: repurposed as "number of work handler threads". Still defaults to 4. - `-rpcworkqueue`: maximum depth of work queue. When this is reached, new requests will return a 500 Internal Error. - `-rpctimeout`: inactivity time, in seconds, after which to disconnect a client. - `-debug=http`: low-level http activity logging
It turns out that some miners have been staying with old versions of Bitcoin Core because their software behaves poorly with persistent connections and the Bitcoin Core thread and connection limits. What happens is that underlying HTTP libraries leave connections open invisibly to their users and then the user runs into the default four thread limit. This looks like Bitcoin Core is unresponsive to RPC. There are many things that should be improved in Bitcoin Core's behavior here, e.g. supporting more concurrent connections, not tying up threads for idle connections, disconnecting kept-alive connections when limits are reached, etc. All are fairly big, risky changes. Disabling keep-alive is a simple workaround. It's often not easy to turn off the keep-alive support in the client where it may be buried in some platform library. If you are one of the few who really needs persistent connections you probably know that you want them and can find a switch; while if you don't and the misbehavior is hitting you it is hard to discover the source of your problems is keepalive related. Given that it is best to default to off until they're handled better. Github-Merge: bitcoin#5655 Rebased-From: 16a5c18 56c1093 1dd8ee7 (cherry picked from commit aaf55d2)
It turns out that some miners have been staying with old versions of
Bitcoin Core because their software behaves poorly with persistent
connections and the Bitcoin Core thread and connection limits.
What happens is that underlying HTTP libraries leave connections open
invisibly to their users and then the user runs into the default four
thread limit. This looks like Bitcoin Core is unresponsive to RPC.
There are many things that should be improved in Bitcoin Core's behavior
here, e.g. supporting more concurrent connections, not tying up threads
for idle connections, disconnecting kept-alive connections when limits
are reached, etc. All are fairly big, risky changes.
Disabling keep-alive is a simple workaround. It's often not easy to turn
off the keep-alive support in the client where it may be buried in some
platform library.
If you are one of the few who really needs persistent connections you
probably know that you want them and can find a switch; while if you
don't and the misbehavior is hitting you it is hard to discover the
source of your problems is keepalive related. Given that it is best
to default to off until they're handled better.