Skip to content

Conversation

MSP-Greg
Copy link
Member

Description

Fixes long-tail problem. When Puma has a large request load, keep-alive connections will have response times where the longest 5% are excessive. Fixing this issue involves changing how successive keep-alive requests are routed and also changes how requests are pulled from the OS's backlog.

This PR replaces PR #3506. See also issue #3487.

Closes #3487

The long-tail issue (refresher)

Puma has a number of 'bins' it processes responses (app.call) in. The number of bins is equal to the number of workers times the maximum thread count, and it can be one. Each bin can only process one response at a time.

Assume the app's response takes 10 mS to process and assume that the bins do not interfere with or block each other. Each can process 100 responses a second (1/0.01). Let's assume that Puma is running two workers with a maximum thread setting of 3. There are six bins (2 workers * 3 threads), so the server can process a maximum of 600 responses a second.

Now, assume the server is idle, and it receives a burst of 30 requests. Since it can only process 6 responses at a time, the last six responses with take 50 mS. Note that dynamic burst testing is difficult, so testing must rely on 'steady state' testing.

Regarding the long-tail issue, if a server receives more requests than it can concurrently process, it should distribute the delay across all responses. Currently, with keep-alive requests, the delay is concentrated in approximately the top 5% percent of responses. Depending on the load, they may be 10 times longer than the 50% response time.

Updates / Changes

  • Add backlog_max and reactor_max to stats - can be used in CI, as it may show how many client connections are being held in Puma

  • Update benchmarks to output the above.

  • Add test to check if a large request body (greater than the 64kB read size) is read on first pass.

  • Add a variable delay to the select loop for each worker. Delay is based on ThreadPool size and Reactor size.

  • Cleanup and remove code no longer needed.

  • 'client.rb - continue reading request body if readable' - fix so large request bodies are read if available.

  • Small test fixups.

General

  • An issue with server code is that compute power and network speed increases over the years. The keep-alive code in Puma (Client.reset) used a read timeout of 200 mS, and was commited in 2012. Jump to 2025, and a 200 mS delay to wait for a request is a very long time.

  • Benchmarks - when using local http tools (wrk, hey, ab) there is currrently no allowance for network latency.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@MSP-Greg
Copy link
Member Author

I've created two gists with test/benchmark data.

This contains summaries generated from the 'raw data' gist. Example below. The 'Mult' column is the number of concurrent requests sent to the server. For the below, we have 40 'bins' (8 workers * 5 threads), so each row shows behavior with various request overloads. The 'rps %' (requests per second) column is normalized to 100%, and the latency columns are normalized to 1. '00-long-tail-pr-2' is the fork branch, '74b7f010d793' is the commit that is 'Puma 'master' with changes needed to run the benchmarks.

Runs were done for both keep-alive and non-keep-alive connections. Each file contains runs for 8, 4, 2, and 0 workers. The files/docs are named by the response time, which is varied from 3 to 600 mS.

The below shows current master latency with a 3 mS response time. Note the increase in times for 95% thru 100% when Puma is overloaded (mults 1.50 thru 4.00).

Branch: 74b7f010d793     Puma: -w8  -t5:5  dly 0.003
                                ───────────────────── Hey Latency ───────────────────── Long Tail
Mult/Conn  requests   rps %     10%    25%    50%    75%    90%    95%    99%    100%   100% / 10%
4.00 160    16000      69.6     0.28   0.28   0.28   0.29   1.76  10.65  11.01  17.62     64.06
3.00 120    12000      71.2     0.37   0.37   0.38   0.40   1.52   9.86  10.39  16.12     43.97
2.00  80     8000      69.9     0.55   0.55   0.55   0.57   0.65   7.83   8.18  17.85     32.45
1.50  60     6000      71.1     0.73   0.73   0.73   0.76   0.89   6.09   6.98  14.73     20.09
1.00  40     4000      73.2     1.10   1.10   1.10   1.13   1.20   1.30   2.30  12.73     11.58
0.50  20     2000      43.1     1.10   1.10   1.13   1.13   1.17   1.20   1.47   3.10      2.82

This gist contains the raw logs from the benchmarks. They also have sections for info about the max size of the Reactor and ThreadPool, along with detail of request spread across workers.

Branch: 00-long-tail-pr-2 Puma: -w8  -t5:5  dly 0.003    ───────────────── Worker Request Info ─────────────────
                         ── Reactor ──   ── Backlog ──    Std                       % deviation
Mult/Conn  requests       Min     Max     Min     Max     Dev                       from 12.50%
4.00 160    16000         13      16      17      27      0.49    -0.6  -0.4  -0.4  -0.2   0.0   0.1   0.8   0.8
3.00 120    12000         15      18      12      17      0.74    -1.4  -0.6  -0.1  -0.1   0.2   0.3   0.4   1.3
2.00  80     8000          6       8       6      12      0.99    -1.2  -0.8  -0.8  -0.6  -0.2   0.7   1.3   1.6
1.50  60     6000          7       8       4       8      0.85    -1.1  -0.9  -0.5  -0.4  -0.1   0.8   0.9   1.3
1.00  40     4000          3       4       2       4      4.66    -9.0  -5.0  -2.4   1.4   2.2   4.0   4.2   4.6
0.50  20     2000          2       3       2       2     25.98   -50.0 -20.0 -10.0 -10.0  10.0  20.0  30.0  30.0

Copy link
Contributor

@schneems schneems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've asked for some modifications, nothing truly blocking, mostly additions and cleanups. I want to release whats on main before merging this though.

I'm also proposing a clean release of 6.6 and we can target main to be v7 with this as the headline change (#3679). I would like to release this as a pre-release and fast follow with a real release if no major issues. If you wouldn't mind, please start thinking of a new codename.

# @todo remove each once Ruby 2.5 is no longer supported
status.tr('}{"', '').strip.split(", ").each do |kv|
cntr = 0
ary = kv.split(':').each do |t|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned but unused ary

Suggested change
ary = kv.split(':').each do |t|
kv.split(':').each do |t|

'max_threads' => max_threads,
'requests_count' => 0,
'busy_threads' => 0
'reactor_max' => 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: Add reactor_max and backlog_max to the stats.md. Based on your PR, I believe this description matches current behavior:

* reactor_max: the maximum observed number of requests held in Puma's "reactor" which is used for asyncronously buffering request bodies. This stat is reset on every call, so it's the maximum value observed since the last stat call.
* backlog_max: the maximum number of requests that have been fully buffered by the reactor and placed in a ready queue, but have not yet been picked up by a server thread. This stat is reset on every call, so it's the maximum value observed since the last stat call.

Optional: We use the language "backlog" in the context of stats here to mean "backlog for threads" but we also use it in the DSL to mean "number of allowed connections that can queue on the socket"

# * Set the socket backlog depth with +backlog+, default is 1024.

Since it's harder to change the language in the stats (that we're already using) we could update/change the usage in the DSL to "socket queue depth" to minimize confusion in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 233fc496f97f

I added the lines to docs/stats.md.

As to the DSL text, maybe we should mention that the value is used with TCPServer#listen, and controls the OS's socket backlog/queue? Both terms seem to be used in Ruby docs, and a lot of OS docs use backlog. I think it can be queried with some OS's.

Comment on lines 1367 to 1368
body = nil
content_length = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned but not read/used

Suggested change
body = nil
content_length = nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#
# The default is 10.
def max_fast_inline(num_of_requests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a deprecation warning to stderr here please

Suggested change
def max_fast_inline(num_of_requests)
def max_fast_inline(num_of_requests)
warn "[WARNING] `max_fast_inline` is deprecated use `max_keep_alive` instead"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MSP-Greg
Copy link
Member Author

@schneems

Good day.

I think I may pull two of the last commits (client.rb - continue reading request body if readable and CI: test_puma_server.rb - add tests for proper request body reading) from this. I noticed it while testing.

Currently, if a request is longer than 64 kb, Puma will push the request into the reactor. I thought that was odd behavior. If the socket is readable, Puma should continue to read the socket and see if the complete request body has been sent.

Fixed the issue, but the test often fails. It passes on retry on Ubuntu, but often fails completely on macOS. I've seen the same locally when running tests parallel.

A few other items:

  1. The code will close a keep-alive connection after 25 requests. It can be adjusted thru config. Not sure if 25 is a good default value.
  2. As you may recall, there have been some previous iterations of this regarding the select-loop delay. I settled on this as the minimal amount that seems to work with various app response times. I tested from 3 mS to 600 mS.
  3. Since I've only tested with my local systems, some real world testing would be helpful.
  4. Most of the testing I've done uses hey. When running ab or wrk locally, I get intermittent exceeding long responses at the high end (95% is okay). But, I see the same with several Puma versions, along with this PR. Maybe something odd with my system. not sure. I also use Ruby head for almost all my testing.

@MSP-Greg MSP-Greg force-pushed the 00-long-tail-pr-2 branch 2 times, most recently from 9831d30 to e170f5d Compare July 30, 2025 00:38
@schneems
Copy link
Contributor

The code will close a keep-alive connection after 25 requests. It can be adjusted thru config. Not sure if 25 is a good default value.

Seems good. We can tune this in the future. My approach would be looking at what other tools have done and seeing if there's some discussion around pros/cons. I don't think this process should be blocking though.

Nginx uses 1000 https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. The docs mention memory being a tradeoff. I'm not sure what the memory overhead of a keepalive connection is in puma. I would also think the bigger memory concern is the total number of retained requests and closing after N requests is only a proxy config for reducing the retained requests number.

As you may recall, there have been some previous iterations of this regarding the select-loop delay. I settled on this as the minimal amount that seems to work with various app response times. I tested from 3 mS to 600 mS.

I'm good with your best judgement here. IMHO the old values were turned for the old behavior. I think you've done a lot with benchmarks and friends. My preference is to get this into the real world and get feedback.

Since I've only tested with my local systems, some real world testing would be helpful.

Agreed

Most of the testing I've done uses hey. When running ab or wrk locally, I get intermittent exceeding long responses at the high end (95% is okay). But, I see the same with several Puma versions, along with this PR. Maybe something odd with my system. not sure. I also use Ruby head for almost all my testing.

Good to note. Since you're seeing it in older versions, I wouldn't consider this a regression so it's non blocking. Might be worth opening an issue to track, ideally isolate to some specific conditions.

@MSP-Greg MSP-Greg force-pushed the 00-long-tail-pr-2 branch 2 times, most recently from 87816cd to 6145745 Compare July 31, 2025 15:13
@MSP-Greg MSP-Greg force-pushed the 00-long-tail-pr-2 branch from 6145745 to 2736ebd Compare July 31, 2025 17:22
@MSP-Greg
Copy link
Member Author

If anyone is interested in trying this, I believe the following change to the Gemfile will work:

gem "puma", github: "puma/puma", ref: "pull/3678/head"

@schneems schneems merged commit 1a55514 into puma:master Jul 31, 2025
83 checks passed
@MSP-Greg
Copy link
Member Author

MSP-Greg commented Aug 1, 2025

This is included in 7.0.0.pre1, which is available as a gem.

@MSP-Greg MSP-Greg removed the waiting-for-review Waiting on review from anyone label Aug 1, 2025
@joshuay03 joshuay03 mentioned this pull request Aug 2, 2025
@papes1ns papes1ns mentioned this pull request Aug 18, 2025
2 tasks
jrochkind added a commit to sciencehistory/scihist_digicoll that referenced this pull request Aug 18, 2025
Heroku will be updating to their router 2.0 with persistent http connections to app. This fix is needed to avoid performance regression when they do so. puma/puma#3678

Normally I wouldn't update to a pre-release puma version, but seems the best option here.
@jrochkind
Copy link

jrochkind commented Aug 25, 2025

Is it possible that this fix results in much lower puma resource consumption/higher performance even if you aren't getting traffic using keep-alive?

I have upgraded my app to puma 7.0.0.pre1. it still is running on heroku, but still running on "router 1.0" (Server: Cowboy), which means there should not be persistent connections/keep-alive involved if i understand it right.

Nonetheless, it looks to me like for the same volume of traffic, we are using many fewer max/peak puma workers. We were usually using 4-5 before, with fairly regular higher spikes to 7-15. Now usually just 1-2 puma workers for same traffic, with seldom any spikes and they are seldom higher than 5.

i don't have quite enough data to be sure of this -- this is just using heroku "metrics" dashboard, assuming it is working properly (including for monitoring num puma workers), and without having captured actual data from before we deployed puma 7.0.0.pre1 (but I know we used to have more than 1-2 threads in use).

But thought I'd put this out there in case anyone else is seeing anything. Is it even plausible that "long tail" improvements could apply to more than just keep-alive? Whatever is going on, it is certainly very welcome that my puma thread usage has gone down so much.

(OR has puma 7.0.0 broken heroku's 'puma pool usage' monitor, would be another possibility!)

Comment on lines +372 to +381
# only use delay when clustered and busy
if pool.busy_threads >= @max_threads
if @clustered
delay = 0.0001 * ((@reactor&.reactor_size || 0) + pool.busy_threads * 1.5)/max_flt
sleep delay
else
# use small sleep for busy single worker
sleep 0.0001
end
end
Copy link
Contributor

@joshuay03 joshuay03 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to replace efficient condition variable waits with sleep-based delays. Before this change, pool.wait_until_not_full and pool.wait_for_less_busy_worker used @not_full.wait @mutex which could wake up immediately when threads became available. Now we have calculated sleep delays that always wait the full duration even if threads free up right away.

In cluster mode this might be particularly bad because the delay calculation becomes more complex (factoring in reactor size and weighted busy threads), potentially creating longer delays that can't be interrupted when worker threads finish. The old condition variable approach might be more responsive under load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other issue I noticed with that delay calculation: it includes @reactor&.reactor_size which represents clients waiting to be buffered, but those could be stuck due to slow clients or network issues rather than actual system load. So you might get artificially high delays when dealing with slow clients, even if the thread pool isn't actually overloaded.

@schneems
Copy link
Contributor

Is it possible that this fix results in much lower puma resource consumption/higher performance even if you aren't getting traffic using keep-alive?

FWIW I noticed that my memory went down when I deployed this, but I made the mistake of changing several things at a time so it's not conclusive.

For feedback or general discussion please post at #3697. Only people subscribed to this thread will get comments on it, and I like to focus comments on the act of approving or closing the PR. Now that it's merged if we can move comments somewhere else, it would be better.

this seems to replace efficient condition

Could you open the an issue or PR for discussion. I think that we always want to wait a tiny bit which triggers the gitlab patch behavior (to better balance out worker scheduling by the OS). We could replace with a condition variable + always waiting a set amount or some other kind of delay. I think perhaps there's also a bit of tuning the accept rate going on here based on load (so workers with more current capacity will call accept slightly faster) but it's all tangled.

Either way, please move conversation elsewhere, and it would be good to talk about "when someone wants to modify or optimize this code, what should they be looking at?" There's also a bunch of benchmarks checked in now, perhaps one of them shows some differences.

schneems added a commit that referenced this pull request Sep 2, 2025
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number.

With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue.

The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
schneems added a commit that referenced this pull request Sep 3, 2025
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number.

With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue.

The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
schneems added a commit that referenced this pull request Sep 3, 2025
When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number.

With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue.

The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perf issue: High longtail response times with KeepAlive=true when running with(or without) workers
5 participants