-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix long tail response problem - revised #3678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've created two gists with test/benchmark data. This contains summaries generated from the 'raw data' gist. Example below. The 'Mult' column is the number of concurrent requests sent to the server. For the below, we have 40 'bins' (8 workers * 5 threads), so each row shows behavior with various request overloads. The 'rps %' (requests per second) column is normalized to 100%, and the latency columns are normalized to 1. '00-long-tail-pr-2' is the fork branch, '74b7f010d793' is the commit that is 'Puma 'master' with changes needed to run the benchmarks. Runs were done for both keep-alive and non-keep-alive connections. Each file contains runs for 8, 4, 2, and 0 workers. The files/docs are named by the response time, which is varied from 3 to 600 mS. The below shows current master latency with a 3 mS response time. Note the increase in times for 95% thru 100% when Puma is overloaded (mults 1.50 thru 4.00).
This gist contains the raw logs from the benchmarks. They also have sections for info about the max size of the Reactor and ThreadPool, along with detail of request spread across workers.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've asked for some modifications, nothing truly blocking, mostly additions and cleanups. I want to release whats on main before merging this though.
I'm also proposing a clean release of 6.6 and we can target main to be v7 with this as the headline change (#3679). I would like to release this as a pre-release and fast follow with a real release if no major issues. If you wouldn't mind, please start thinking of a new codename.
lib/puma/cluster/worker_handle.rb
Outdated
# @todo remove each once Ruby 2.5 is no longer supported | ||
status.tr('}{"', '').strip.split(", ").each do |kv| | ||
cntr = 0 | ||
ary = kv.split(':').each do |t| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned but unused ary
ary = kv.split(':').each do |t| | |
kv.split(':').each do |t| |
'max_threads' => max_threads, | ||
'requests_count' => 0, | ||
'busy_threads' => 0 | ||
'reactor_max' => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request: Add reactor_max
and backlog_max
to the stats.md
. Based on your PR, I believe this description matches current behavior:
* reactor_max: the maximum observed number of requests held in Puma's "reactor" which is used for asyncronously buffering request bodies. This stat is reset on every call, so it's the maximum value observed since the last stat call.
* backlog_max: the maximum number of requests that have been fully buffered by the reactor and placed in a ready queue, but have not yet been picked up by a server thread. This stat is reset on every call, so it's the maximum value observed since the last stat call.
Optional: We use the language "backlog" in the context of stats here to mean "backlog for threads" but we also use it in the DSL to mean "number of allowed connections that can queue on the socket"
# * Set the socket backlog depth with +backlog+, default is 1024.
Since it's harder to change the language in the stats (that we're already using) we could update/change the usage in the DSL to "socket queue depth" to minimize confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 233fc496f97f
I added the lines to docs/stats.md
.
As to the DSL text, maybe we should mention that the value is used with TCPServer#listen
, and controls the OS's socket backlog/queue? Both terms seem to be used in Ruby docs, and a lot of OS docs use backlog. I think it can be queried with some OS's.
test/test_puma_server.rb
Outdated
body = nil | ||
content_length = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned but not read/used
body = nil | |
content_length = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 233fc496f97f
# | ||
# The default is 10. | ||
def max_fast_inline(num_of_requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a deprecation warning to stderr here please
def max_fast_inline(num_of_requests) | |
def max_fast_inline(num_of_requests) | |
warn "[WARNING] `max_fast_inline` is deprecated use `max_keep_alive` instead" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 233fc496f97f
Good day. I think I may pull two of the last commits (client.rb - continue reading request body if readable and CI: test_puma_server.rb - add tests for proper request body reading) from this. I noticed it while testing. Currently, if a request is longer than 64 kb, Puma will push the request into the reactor. I thought that was odd behavior. If the socket is readable, Puma should continue to read the socket and see if the complete request body has been sent. Fixed the issue, but the test often fails. It passes on retry on Ubuntu, but often fails completely on macOS. I've seen the same locally when running tests parallel. A few other items:
|
9831d30
to
e170f5d
Compare
Seems good. We can tune this in the future. My approach would be looking at what other tools have done and seeing if there's some discussion around pros/cons. I don't think this process should be blocking though. Nginx uses 1000 https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. The docs mention memory being a tradeoff. I'm not sure what the memory overhead of a keepalive connection is in puma. I would also think the bigger memory concern is the total number of retained requests and closing after N requests is only a proxy config for reducing the retained requests number.
I'm good with your best judgement here. IMHO the old values were turned for the old behavior. I think you've done a lot with benchmarks and friends. My preference is to get this into the real world and get feedback.
Agreed
Good to note. Since you're seeing it in older versions, I wouldn't consider this a regression so it's non blocking. Might be worth opening an issue to track, ideally isolate to some specific conditions. |
87816cd
to
6145745
Compare
6145745
to
2736ebd
Compare
If anyone is interested in trying this, I believe the following change to the Gemfile will work: gem "puma", github: "puma/puma", ref: "pull/3678/head" |
This is included in 7.0.0.pre1, which is available as a gem. |
Heroku will be updating to their router 2.0 with persistent http connections to app. This fix is needed to avoid performance regression when they do so. puma/puma#3678 Normally I wouldn't update to a pre-release puma version, but seems the best option here.
Is it possible that this fix results in much lower puma resource consumption/higher performance even if you aren't getting traffic using keep-alive? I have upgraded my app to puma 7.0.0.pre1. it still is running on heroku, but still running on "router 1.0" ( Nonetheless, it looks to me like for the same volume of traffic, we are using many fewer max/peak puma workers. We were usually using 4-5 before, with fairly regular higher spikes to 7-15. Now usually just 1-2 puma workers for same traffic, with seldom any spikes and they are seldom higher than 5. i don't have quite enough data to be sure of this -- this is just using heroku "metrics" dashboard, assuming it is working properly (including for monitoring num puma workers), and without having captured actual data from before we deployed puma 7.0.0.pre1 (but I know we used to have more than 1-2 threads in use). But thought I'd put this out there in case anyone else is seeing anything. Is it even plausible that "long tail" improvements could apply to more than just keep-alive? Whatever is going on, it is certainly very welcome that my puma thread usage has gone down so much. (OR has puma 7.0.0 broken heroku's 'puma pool usage' monitor, would be another possibility!) |
# only use delay when clustered and busy | ||
if pool.busy_threads >= @max_threads | ||
if @clustered | ||
delay = 0.0001 * ((@reactor&.reactor_size || 0) + pool.busy_threads * 1.5)/max_flt | ||
sleep delay | ||
else | ||
# use small sleep for busy single worker | ||
sleep 0.0001 | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to replace efficient condition variable waits with sleep-based delays. Before this change, pool.wait_until_not_full
and pool.wait_for_less_busy_worker
used @not_full.wait @mutex
which could wake up immediately when threads became available. Now we have calculated sleep delays that always wait the full duration even if threads free up right away.
In cluster mode this might be particularly bad because the delay calculation becomes more complex (factoring in reactor size and weighted busy threads), potentially creating longer delays that can't be interrupted when worker threads finish. The old condition variable approach might be more responsive under load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other issue I noticed with that delay calculation: it includes @reactor&.reactor_size
which represents clients waiting to be buffered, but those could be stuck due to slow clients or network issues rather than actual system load. So you might get artificially high delays when dealing with slow clients, even if the thread pool isn't actually overloaded.
FWIW I noticed that my memory went down when I deployed this, but I made the mistake of changing several things at a time so it's not conclusive. For feedback or general discussion please post at #3697. Only people subscribed to this thread will get comments on it, and I like to focus comments on the act of approving or closing the PR. Now that it's merged if we can move comments somewhere else, it would be better.
Could you open the an issue or PR for discussion. I think that we always want to wait a tiny bit which triggers the gitlab patch behavior (to better balance out worker scheduling by the OS). We could replace with a condition variable + always waiting a set amount or some other kind of delay. I think perhaps there's also a bit of tuning the accept rate going on here based on load (so workers with more current capacity will call accept slightly faster) but it's all tangled. Either way, please move conversation elsewhere, and it would be good to talk about "when someone wants to modify or optimize this code, what should they be looking at?" There's also a bunch of benchmarks checked in now, perhaps one of them shows some differences. |
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number. With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue. The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number. With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue. The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number. With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue. The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
Description
Fixes long-tail problem. When Puma has a large request load, keep-alive connections will have response times where the longest 5% are excessive. Fixing this issue involves changing how successive keep-alive requests are routed and also changes how requests are pulled from the OS's backlog.
This PR replaces PR #3506. See also issue #3487.
Closes #3487
The long-tail issue (refresher)
Puma has a number of 'bins' it processes responses (app.call) in. The number of bins is equal to the number of workers times the maximum thread count, and it can be one. Each bin can only process one response at a time.
Assume the app's response takes 10 mS to process and assume that the bins do not interfere with or block each other. Each can process 100 responses a second (1/0.01). Let's assume that Puma is running two workers with a maximum thread setting of 3. There are six bins (2 workers * 3 threads), so the server can process a maximum of 600 responses a second.
Now, assume the server is idle, and it receives a burst of 30 requests. Since it can only process 6 responses at a time, the last six responses with take 50 mS. Note that dynamic burst testing is difficult, so testing must rely on 'steady state' testing.
Regarding the long-tail issue, if a server receives more requests than it can concurrently process, it should distribute the delay across all responses. Currently, with keep-alive requests, the delay is concentrated in approximately the top 5% percent of responses. Depending on the load, they may be 10 times longer than the 50% response time.
Updates / Changes
Add
backlog_max
andreactor_max
to stats - can be used in CI, as it may show how many client connections are being held in PumaUpdate benchmarks to output the above.
Add test to check if a large request body (greater than the 64kB read size) is read on first pass.
Add a variable delay to the select loop for each worker. Delay is based on ThreadPool size and Reactor size.
Cleanup and remove code no longer needed.
'client.rb - continue reading request body if readable' - fix so large request bodies are read if available.
Small test fixups.
General
An issue with server code is that compute power and network speed increases over the years. The keep-alive code in Puma (
Client.reset
) used a read timeout of 200 mS, and was commited in 2012. Jump to 2025, and a 200 mS delay to wait for a request is a very long time.Benchmarks - when using local http tools (wrk, hey, ab) there is currrently no allowance for network latency.
Your checklist for this pull request
[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.