Fix long tail response problem - revised #3678

MSP-Greg · 2025-07-27T14:49:31Z

Description

Fixes long-tail problem. When Puma has a large request load, keep-alive connections will have response times where the longest 5% are excessive. Fixing this issue involves changing how successive keep-alive requests are routed and also changes how requests are pulled from the OS's backlog.

This PR replaces PR #3506. See also issue #3487.

Closes #3487

The long-tail issue (refresher)

Puma has a number of 'bins' it processes responses (app.call) in. The number of bins is equal to the number of workers times the maximum thread count, and it can be one. Each bin can only process one response at a time.

Assume the app's response takes 10 mS to process and assume that the bins do not interfere with or block each other. Each can process 100 responses a second (1/0.01). Let's assume that Puma is running two workers with a maximum thread setting of 3. There are six bins (2 workers * 3 threads), so the server can process a maximum of 600 responses a second.

Now, assume the server is idle, and it receives a burst of 30 requests. Since it can only process 6 responses at a time, the last six responses with take 50 mS. Note that dynamic burst testing is difficult, so testing must rely on 'steady state' testing.

Regarding the long-tail issue, if a server receives more requests than it can concurrently process, it should distribute the delay across all responses. Currently, with keep-alive requests, the delay is concentrated in approximately the top 5% percent of responses. Depending on the load, they may be 10 times longer than the 50% response time.

Updates / Changes

Add backlog_max and reactor_max to stats - can be used in CI, as it may show how many client connections are being held in Puma
Update benchmarks to output the above.
Add test to check if a large request body (greater than the 64kB read size) is read on first pass.
Add a variable delay to the select loop for each worker. Delay is based on ThreadPool size and Reactor size.
Cleanup and remove code no longer needed.
'client.rb - continue reading request body if readable' - fix so large request bodies are read if available.
Small test fixups.

General

An issue with server code is that compute power and network speed increases over the years. The keep-alive code in Puma (Client.reset) used a read timeout of 200 mS, and was commited in 2012. Jump to 2025, and a 200 mS delay to wait for a request is a very long time.
Benchmarks - when using local http tools (wrk, hey, ab) there is currrently no allowance for network latency.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

MSP-Greg · 2025-07-27T14:50:05Z

I've created two gists with test/benchmark data.

Puma keep-alive rev2 latency summary

This contains summaries generated from the 'raw data' gist. Example below. The 'Mult' column is the number of concurrent requests sent to the server. For the below, we have 40 'bins' (8 workers * 5 threads), so each row shows behavior with various request overloads. The 'rps %' (requests per second) column is normalized to 100%, and the latency columns are normalized to 1. '00-long-tail-pr-2' is the fork branch, '74b7f010d793' is the commit that is 'Puma 'master' with changes needed to run the benchmarks.

Runs were done for both keep-alive and non-keep-alive connections. Each file contains runs for 8, 4, 2, and 0 workers. The files/docs are named by the response time, which is varied from 3 to 600 mS.

The below shows current master latency with a 3 mS response time. Note the increase in times for 95% thru 100% when Puma is overloaded (mults 1.50 thru 4.00).

Branch: 74b7f010d793     Puma: -w8  -t5:5  dly 0.003
                                ───────────────────── Hey Latency ───────────────────── Long Tail
Mult/Conn  requests   rps %     10%    25%    50%    75%    90%    95%    99%    100%   100% / 10%
4.00 160    16000      69.6     0.28   0.28   0.28   0.29   1.76  10.65  11.01  17.62     64.06
3.00 120    12000      71.2     0.37   0.37   0.38   0.40   1.52   9.86  10.39  16.12     43.97
2.00  80     8000      69.9     0.55   0.55   0.55   0.57   0.65   7.83   8.18  17.85     32.45
1.50  60     6000      71.1     0.73   0.73   0.73   0.76   0.89   6.09   6.98  14.73     20.09
1.00  40     4000      73.2     1.10   1.10   1.10   1.13   1.20   1.30   2.30  12.73     11.58
0.50  20     2000      43.1     1.10   1.10   1.13   1.13   1.17   1.20   1.47   3.10      2.82

Puma keep-alive rev2 raw data

This gist contains the raw logs from the benchmarks. They also have sections for info about the max size of the Reactor and ThreadPool, along with detail of request spread across workers.

Branch: 00-long-tail-pr-2 Puma: -w8  -t5:5  dly 0.003    ───────────────── Worker Request Info ─────────────────
                         ── Reactor ──   ── Backlog ──    Std                       % deviation
Mult/Conn  requests       Min     Max     Min     Max     Dev                       from 12.50%
4.00 160    16000         13      16      17      27      0.49    -0.6  -0.4  -0.4  -0.2   0.0   0.1   0.8   0.8
3.00 120    12000         15      18      12      17      0.74    -1.4  -0.6  -0.1  -0.1   0.2   0.3   0.4   1.3
2.00  80     8000          6       8       6      12      0.99    -1.2  -0.8  -0.8  -0.6  -0.2   0.7   1.3   1.6
1.50  60     6000          7       8       4       8      0.85    -1.1  -0.9  -0.5  -0.4  -0.1   0.8   0.9   1.3
1.00  40     4000          3       4       2       4      4.66    -9.0  -5.0  -2.4   1.4   2.2   4.0   4.2   4.6
0.50  20     2000          2       3       2       2     25.98   -50.0 -20.0 -10.0 -10.0  10.0  20.0  30.0  30.0

schneems

I've asked for some modifications, nothing truly blocking, mostly additions and cleanups. I want to release whats on main before merging this though.

I'm also proposing a clean release of 6.6 and we can target main to be v7 with this as the headline change (#3679). I would like to release this as a pre-release and fast follow with a real release if no major issues. If you wouldn't mind, please start thinking of a new codename.

schneems · 2025-07-29T18:36:17Z

lib/puma/cluster/worker_handle.rb

+        # @todo remove each once Ruby 2.5 is no longer supported
+        status.tr('}{"', '').strip.split(", ").each do |kv|
+          cntr = 0
+          ary = kv.split(':').each do |t|


Assigned but unused ary

Suggested change

ary = kv.split(':').each do |t|

kv.split(':').each do |t|

schneems · 2025-07-29T18:48:08Z

test/test_integration_pumactl.rb

      'max_threads'    => max_threads,
      'requests_count' => 0,
-      'busy_threads'   => 0
+      'reactor_max'    => 0,


Request: Add reactor_max and backlog_max to the stats.md. Based on your PR, I believe this description matches current behavior:

* reactor_max: the maximum observed number of requests held in Puma's "reactor" which is used for asyncronously buffering request bodies. This stat is reset on every call, so it's the maximum value observed since the last stat call. * backlog_max: the maximum number of requests that have been fully buffered by the reactor and placed in a ready queue, but have not yet been picked up by a server thread. This stat is reset on every call, so it's the maximum value observed since the last stat call.

Optional: We use the language "backlog" in the context of stats here to mean "backlog for threads" but we also use it in the DSL to mean "number of allowed connections that can queue on the socket"

# * Set the socket backlog depth with +backlog+, default is 1024.

Since it's harder to change the language in the stats (that we're already using) we could update/change the usage in the DSL to "socket queue depth" to minimize confusion in the future.

See 233fc496f97f

I added the lines to docs/stats.md.

As to the DSL text, maybe we should mention that the value is used with TCPServer#listen, and controls the OS's socket backlog/queue? Both terms seem to be used in Ruby docs, and a lot of OS docs use backlog. I think it can be queried with some OS's.

schneems · 2025-07-29T19:14:14Z

test/test_puma_server.rb

+    body = nil
+    content_length = nil


Assigned but not read/used

Suggested change

body = nil

content_length = nil

See 233fc496f97f

schneems · 2025-07-29T19:45:00Z

lib/puma/dsl.rb

    #
-    # The default is 10.
+    def max_fast_inline(num_of_requests)


Add a deprecation warning to stderr here please

Suggested change

def max_fast_inline(num_of_requests)

def max_fast_inline(num_of_requests)

warn "[WARNING] `max_fast_inline` is deprecated use `max_keep_alive` instead"

See 233fc496f97f

test/test_puma_server.rb

MSP-Greg · 2025-07-29T21:06:24Z

@schneems

Good day.

I think I may pull two of the last commits (client.rb - continue reading request body if readable and CI: test_puma_server.rb - add tests for proper request body reading) from this. I noticed it while testing.

Currently, if a request is longer than 64 kb, Puma will push the request into the reactor. I thought that was odd behavior. If the socket is readable, Puma should continue to read the socket and see if the complete request body has been sent.

Fixed the issue, but the test often fails. It passes on retry on Ubuntu, but often fails completely on macOS. I've seen the same locally when running tests parallel.

A few other items:

The code will close a keep-alive connection after 25 requests. It can be adjusted thru config. Not sure if 25 is a good default value.
As you may recall, there have been some previous iterations of this regarding the select-loop delay. I settled on this as the minimal amount that seems to work with various app response times. I tested from 3 mS to 600 mS.
Since I've only tested with my local systems, some real world testing would be helpful.
Most of the testing I've done uses hey. When running ab or wrk locally, I get intermittent exceeding long responses at the high end (95% is okay). But, I see the same with several Puma versions, along with this PR. Maybe something odd with my system. not sure. I also use Ruby head for almost all my testing.

schneems · 2025-07-30T17:09:51Z

The code will close a keep-alive connection after 25 requests. It can be adjusted thru config. Not sure if 25 is a good default value.

Seems good. We can tune this in the future. My approach would be looking at what other tools have done and seeing if there's some discussion around pros/cons. I don't think this process should be blocking though.

Nginx uses 1000 https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. The docs mention memory being a tradeoff. I'm not sure what the memory overhead of a keepalive connection is in puma. I would also think the bigger memory concern is the total number of retained requests and closing after N requests is only a proxy config for reducing the retained requests number.

As you may recall, there have been some previous iterations of this regarding the select-loop delay. I settled on this as the minimal amount that seems to work with various app response times. I tested from 3 mS to 600 mS.

I'm good with your best judgement here. IMHO the old values were turned for the old behavior. I think you've done a lot with benchmarks and friends. My preference is to get this into the real world and get feedback.

Since I've only tested with my local systems, some real world testing would be helpful.

Agreed

Most of the testing I've done uses hey. When running ab or wrk locally, I get intermittent exceeding long responses at the high end (95% is okay). But, I see the same with several Puma versions, along with this PR. Maybe something odd with my system. not sure. I also use Ruby head for almost all my testing.

Good to note. Since you're seeing it in older versions, I wouldn't consider this a regression so it's non blocking. Might be worth opening an issue to track, ideally isolate to some specific conditions.

…s to stats

…back

…erved` attr

…ss_client`

MSP-Greg · 2025-07-31T19:31:02Z

If anyone is interested in trying this, I believe the following change to the Gemfile will work:

gem "puma", github: "puma/puma", ref: "pull/3678/head"

MSP-Greg · 2025-08-01T14:58:04Z

This is included in 7.0.0.pre1, which is available as a gem.

Heroku will be updating to their router 2.0 with persistent http connections to app. This fix is needed to avoid performance regression when they do so. puma/puma#3678 Normally I wouldn't update to a pre-release puma version, but seems the best option here.

jrochkind · 2025-08-25T19:03:05Z

Is it possible that this fix results in much lower puma resource consumption/higher performance even if you aren't getting traffic using keep-alive?

I have upgraded my app to puma 7.0.0.pre1. it still is running on heroku, but still running on "router 1.0" (Server: Cowboy), which means there should not be persistent connections/keep-alive involved if i understand it right.

Nonetheless, it looks to me like for the same volume of traffic, we are using many fewer max/peak puma workers. We were usually using 4-5 before, with fairly regular higher spikes to 7-15. Now usually just 1-2 puma workers for same traffic, with seldom any spikes and they are seldom higher than 5.

i don't have quite enough data to be sure of this -- this is just using heroku "metrics" dashboard, assuming it is working properly (including for monitoring num puma workers), and without having captured actual data from before we deployed puma 7.0.0.pre1 (but I know we used to have more than 1-2 threads in use).

But thought I'd put this out there in case anyone else is seeing anything. Is it even plausible that "long tail" improvements could apply to more than just keep-alive? Whatever is going on, it is certainly very welcome that my puma thread usage has gone down so much.

(OR has puma 7.0.0 broken heroku's 'puma pool usage' monitor, would be another possibility!)

joshuay03 · 2025-08-29T12:24:19Z

lib/puma/server.rb

+                # only use delay when clustered and busy
+                if pool.busy_threads >= @max_threads
+                  if @clustered
+                    delay = 0.0001 * ((@reactor&.reactor_size || 0) + pool.busy_threads * 1.5)/max_flt
+                    sleep delay
+                  else
+                    # use small sleep for busy single worker
+                    sleep 0.0001
+                  end
+                end


This seems to replace efficient condition variable waits with sleep-based delays. Before this change, pool.wait_until_not_full and pool.wait_for_less_busy_worker used @not_full.wait @mutex which could wake up immediately when threads became available. Now we have calculated sleep delays that always wait the full duration even if threads free up right away.

In cluster mode this might be particularly bad because the delay calculation becomes more complex (factoring in reactor size and weighted busy threads), potentially creating longer delays that can't be interrupted when worker threads finish. The old condition variable approach might be more responsive under load.

One other issue I noticed with that delay calculation: it includes @reactor&.reactor_size which represents clients waiting to be buffered, but those could be stuck due to slow clients or network issues rather than actual system load. So you might get artificially high delays when dealing with slow clients, even if the thread pool isn't actually overloaded.

schneems · 2025-08-29T14:12:34Z

Is it possible that this fix results in much lower puma resource consumption/higher performance even if you aren't getting traffic using keep-alive?

FWIW I noticed that my memory went down when I deployed this, but I made the mistake of changing several things at a time so it's not conclusive.

For feedback or general discussion please post at #3697. Only people subscribed to this thread will get comments on it, and I like to focus comments on the act of approving or closing the PR. Now that it's merged if we can move comments somewhere else, it would be better.

this seems to replace efficient condition

Could you open the an issue or PR for discussion. I think that we always want to wait a tiny bit which triggers the gitlab patch behavior (to better balance out worker scheduling by the OS). We could replace with a condition variable + always waiting a set amount or some other kind of delay. I think perhaps there's also a bit of tuning the accept rate going on here based on load (so workers with more current capacity will call accept slightly faster) but it's all tangled.

Either way, please move conversation elsewhere, and it would be good to talk about "when someone wants to modify or optimize this code, what should they be looking at?" There's also a bunch of benchmarks checked in now, perhaps one of them shows some differences.

When we initially merged #3678, we discussed the appropriate setting for this default `max_keep_alive` value and determined we didn't have a terribly good method for setting it. Therefore, we picked a magic number. With #3715, I discovered that the setting `export RUBY_MN_THREADS=1` had disastrous effects on Puma 7.0.0.pre1 performance, but increasing this `max_keep_alive` value dramatically increased performance. The default value with nginx is 1000 http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests. Why not go full nginx and use 1000? Prudence is a virtue. The test was modified to not require sending 1000 requests to observe the keepalive max behavior. This change effectively duplicated the behavior of `test_set_max_keep_alive`, so I removed the redundancy.

MSP-Greg mentioned this pull request Jul 27, 2025

Fix long tail response problem #3506

Closed

7 tasks

github-actions bot added the waiting-for-review Waiting on review from anyone label Jul 27, 2025

jrochkind mentioned this pull request Jul 28, 2025

Check back to see if puma has fixed HTTP persistent connection "long tail response problem" sciencehistory/scihist_digicoll#3065

Closed

schneems mentioned this pull request Jul 29, 2025

v6.6.1 #3679

Merged

schneems reviewed Jul 29, 2025

View reviewed changes

MSP-Greg force-pushed the 00-long-tail-pr-2 branch 2 times, most recently from 9831d30 to e170f5d Compare July 30, 2025 00:38

MSP-Greg force-pushed the 00-long-tail-pr-2 branch 2 times, most recently from 87816cd to 6145745 Compare July 31, 2025 15:13

MSP-Greg added 18 commits July 31, 2025 12:21

reactor.rb - add reactor_size, reactor_max

6321c50

thread_pool.rb - add backlog_max to stats

8e81267

thread_pool.rb - add reset_max for 'reactor_max' stat

8f60a06

server.rb - add 'reactor_max' to stats

28b3328

cluster.rb - add max items to stats

14da904

cluster/worker.rb - remove hard coded stats keys

de766d9

cluster/worker_handle.rb - remove hard coded stats keys, add max item…

70263f6

…s to stats

CI: test_integration_pumactl.rb - update test_clustered_stats

46435d1

CI: puma_socket.rb - update send_http_read_all

2612018

CI: test_puma_server.rb - add test_chunked_keep_alive_twenty_back_to_…

67fd257

…back

Update hey benchmark files to output new Puma stats info

099e599

Create long_tail_hey-0-2-4-8.sh

76d22e6

client.rb - add fast_try_to_finish, revise reset, add `requests_s…

064e689

…erved` attr

request.rb - fixup force_keep_alive

2e6a590

request.rb - close keep-alive connections on shutdown

ac93803

server.rb - update to pass keep-alive requests to the Reactor

b6c52c8

server.rb - sleep while pool.out_of_band_running

b01c450

server.rb - send inline requests to ThreadPool, remove loop in `proce…

db51c01

…ss_client`

MSP-Greg added 5 commits July 31, 2025 12:22

CI: test_puma_server.rb - add tests for proper request body reading

bbaa8c7

puma_socket.rb - fixup send_http

845f4c7

test_puma_server.rb - replace send_http nil with new_socket

139b1e7

Review suugestions from @schneems

233fc49

test\test_puma_server.rb - add tests for max_keep_alive

2736ebd

MSP-Greg force-pushed the 00-long-tail-pr-2 branch from 6145745 to 2736ebd Compare July 31, 2025 17:22

schneems merged commit 1a55514 into puma:master Jul 31, 2025
83 checks passed

MSP-Greg removed the waiting-for-review Waiting on review from anyone label Aug 1, 2025

joshuay03 mentioned this pull request Aug 2, 2025

v7.0.0.pre1 #3682

Merged

dentarg added bug perf labels Aug 2, 2025

MSP-Greg mentioned this pull request Aug 2, 2025

CI: puma_socket.rb - fixup socket/request writes #3684

Merged

7 tasks

runephilosof-abtion mentioned this pull request Aug 4, 2025

Version 7.x breaking change? #3685

Closed

papes1ns mentioned this pull request Aug 18, 2025

Heroku router v2 tweaks galahq/gala#762

Merged

2 tasks

jrochkind mentioned this pull request Aug 18, 2025

update to puma 7.0.0.pre1 with keep-alive long-tail fix sciencehistory/scihist_digicoll#3087

Merged

papes1ns mentioned this pull request Aug 19, 2025

perf: puma server tuned for heroku 2.0 galahq/gala#763

Closed

joshuay03 reviewed Aug 29, 2025

View reviewed changes

This was referenced Aug 30, 2025

Puma serves requests with significantly higher latency on HTTP/1.1 (keepalive) compared to HTTP/1.0 #3443

Closed

Tail Latency with queue_requests in Single Threaded mode #2311

Closed

TechEmpower benchmark is a lot slower with 7.0.0.pre1 #3715

Closed

schneems mentioned this pull request Sep 2, 2025

Set default keepalive connections to 999 #3719

Merged

dentarg mentioned this pull request Sep 3, 2025

RUBY_MN_THREADS=1 causes severe performance degredation #3720

Open

	def max_fast_inline(num_of_requests)
	def max_fast_inline(num_of_requests)
	warn "[WARNING] `max_fast_inline` is deprecated use `max_keep_alive` instead"

Fix long tail response problem - revised #3678

Fix long tail response problem - revised #3678

Uh oh!

Conversation

MSP-Greg commented Jul 27, 2025

Description

The long-tail issue (refresher)

Updates / Changes

General

Your checklist for this pull request

Uh oh!

MSP-Greg commented Jul 27, 2025

Uh oh!

schneems left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MSP-Greg commented Jul 29, 2025

Uh oh!

schneems commented Jul 30, 2025

Uh oh!

MSP-Greg commented Jul 31, 2025

Uh oh!

Uh oh!

MSP-Greg commented Aug 1, 2025

Uh oh!

jrochkind commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshuay03 Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schneems commented Aug 29, 2025

Uh oh!

Uh oh!

jrochkind commented Aug 25, 2025 •

edited

Loading

joshuay03 Aug 29, 2025 •

edited

Loading