feat(gateway): concurrency and timeout limits #994

lidel · 2025-08-11T00:54:22Z

This PR adds built-in gateway limiter middleware, configuration, and tests:

adds MaxConcurrentRequests
- TLDR semaphore ensuring limited number of request in flight, excess gets HTTP 429 Too Many Requests response
- closes gateway: limit the number of requests in flight #881
adds RetrievalTimeout
- TLDR enforces a maximum duration for content retrieval:
  - Time to first byte: If the gateway cannot start writing the response within this duration (e.g., stuck searching for providers), a 504 Gateway Timeout is returned.
  - Time between non-empty writes: After the first byte, the timeout resets each time new bytes are written to the client. If the gateway cannot write additional data within this duration after the last successful write, the response is terminated.
- closes gateway: ability to set response write timeout #679
- also helps with A timeout is required when fetching blocks #908 (if gateway is used as the high level interface for retrieval)
adds optional Config.MetricsRegistry
- TLDR allows removing dependency on global registry, allowing for muyltiple instances (e.g. in parallel tests)

TODO

refactor new metrics to not be globals, and be returned by initializeMiddlewareMetrics
Kubo PR with conformance passing feat(config): Gateway.RetrievalTimeout|MaxConcurrentRequests kubo#10905
Rainbow PR with conformance passing feat: retrieval-timeout & max-concurrent-requests rainbow#285
- fix regression in only-if-cached (e109608)

References

Creates surface to address Update error pages on the gateways to surface debugging information — IPFS/2025 ipshipyard/roadmaps#14

adds MaxConcurrentRequests closes #881 adds RetrievalTimeout Closes #679

allow passing custom registry instead of the global one useful for testing and deployments with multiple gateway instances

lidel

I will fix racy metrics tests and lint after some sleep, but dropping some notes for early review

lidel · 2025-08-11T01:58:23Z

gateway/gateway.go

+// Default values for gateway configuration limits
+const (
+	// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval
+	// (time to first byte) and subsequent writes to the HTTP response body.
+	DefaultRetrievalTimeout = 30 * time.Second
+
+	// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests
+	// that the gateway will process.
+	DefaultMaxConcurrentRequests = 1024


ℹ️ Do these make sense for defaults? My rationale:

30s feels solid (that is what we've been using at ipfs.io under Nginx's proxy_read_timeout)

1024 is 2x the default from Nginx (worker_connections 512)

Of course users can adjust/disable, this default is about "sane default" aiming at desktop users, and YOLO deployments.

Extra notes on MaxConcurrentRequests and Nginx

While testing fix on staging I had to adjust this default because our boxes can handle higher load:

nginx Configuration on staging box:

worker_processes: auto (8 workers based on 8 CPU cores)

worker_connections: 1024 per worker

Total nginx capacity: 8 workers x 1024 = 8192 total connections

For Reverse Proxy:
Each proxied request uses 2 connections in nginx:

Client → nginx (port 443)

nginx → kubo (port 8080)

So iiuc nginx on staging box 02 can proxy ~ 8192/2 = 4096 concurrent requests

I've increased ipfs config Gateway.MaxConcurrentRequests --json 4096 and success rate issue went away – HTTP 200s success rate is on par with control box 01 that runs previous release (0.36), but impact on CPU/memory is much lower:

I think 1024 may be a sensible default (most of deployments run on weaker hardware and do not back % of traffic of ipfs.io) so 1024 is actually reasonable for:

stock nginx (256 proxy capacity)

lightly tuned nginx (1-2k proxy capacity)

single nginx worker setups

I'll just make sure it is properly documented. Update: b42675e

After some thinking adjusted chore: DefaultMaxConcurrentRequests = 4096

Rationale in e170d19: adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive

gateway/middleware_ratelimit.go

gateway/middleware_retrieval_timeout.go

Depends on ipfs/boxo#994

now they always init per handler

codecov · 2025-08-11T18:51:25Z

Codecov Report

❌ Patch coverage is 69.97930% with 145 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.50%. Comparing base (44a4890) to head (9aff4dd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
gateway/backend_blocks.go	5.88%	63 Missing and 1 partial ⚠️
gateway/middleware_retrieval_timeout.go	76.47%	31 Missing and 5 partials ⚠️
gateway/handler.go	70.00%	11 Missing and 4 partials ⚠️
gateway/middleware_metrics.go	88.59%	11 Missing and 2 partials ⚠️
gateway/handler_codec.go	0.00%	5 Missing ⚠️
gateway/errors.go	90.32%	2 Missing and 1 partial ⚠️
gateway/handler_ipns_record.go	0.00%	3 Missing ⚠️
gateway/handler_unixfs_dir.go	0.00%	3 Missing ⚠️
gateway/metrics.go	85.00%	0 Missing and 3 partials ⚠️

@@            Coverage Diff             @@
##             main     #994      +/-   ##
==========================================
+ Coverage   61.39%   61.50%   +0.10%     
==========================================
  Files         254      257       +3     
  Lines       31731    32161     +430     
==========================================
+ Hits        19481    19780     +299     
- Misses      10644    10765     +121     
- Partials     1606     1616      +10

Files with missing lines	Coverage Δ
gateway/gateway.go	`83.54% <100.00%> (ø)`
gateway/handler_car.go	`79.79% <100.00%> (-0.11%)`	⬇️
gateway/handler_tar.go	`84.12% <100.00%> (ø)`
gateway/middleware_ratelimit.go	`100.00% <100.00%> (ø)`
gateway/errors.go	`86.01% <90.32%> (+1.39%)`	⬆️
gateway/handler_ipns_record.go	`17.91% <0.00%> (-0.84%)`	⬇️
gateway/handler_unixfs_dir.go	`64.94% <0.00%> (ø)`
gateway/metrics.go	`80.15% <85.00%> (-2.71%)`	⬇️
gateway/handler_codec.go	`61.68% <0.00%> (-0.59%)`	⬇️
gateway/middleware_metrics.go	`88.59% <88.59%> (ø)`
... and 3 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

go fmt and gci were flip-flopping due to comment in newline and conflicting rules, this removes the problem

Depends on ipfs/boxo#994

aims to fix problem found in rainbow: https://github.com/ipfs/rainbow/actions/runs/16890332583/job/47848231793?pr=285#step:13:917

This aims to fix failure from: https://github.com/ipfs/rainbow/actions/runs/16892227598/job/47854552215?pr=285#step:13:753 Range request regression fix: Modified BlocksBackend.Get() in backend_blocks.go to optimize range requests by: - Detecting when a range request is made - Only fetching the root block first to check if it's a UnixFS file - Using lazy loading to fetch only the blocks needed for the requested range - This avoids the timeout issue when missing blocks exist outside the requested range

gateway/backend_blocks.go

gateway/handler.go

gateway/backend_blocks.go

gateway/errors.go

gammazero · 2025-08-13T02:40:24Z

gateway/gateway.go

+// Default values for gateway configuration limits
+const (
+	// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval
+	// (time to first byte) and subsequent writes to the HTTP response body.
+	DefaultRetrievalTimeout = 30 * time.Second
+
+	// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests
+	// that the gateway will process.
+	DefaultMaxConcurrentRequests = 1024


gateway/middleware_ratelimit_test.go

gateway/middleware_retrieval_timeout.go

#994 (comment)

removed: - unnecessary for loop in timeout goroutine (both select cases return) - unused done and timeoutSignal fields from timeoutWriter struct

lidel · 2025-08-14T16:19:04Z

Fix from d2d586c seems to work on staging (02). No issues in the past 30m. Will let it run for an hour or two just to be sure.

replace dynamic error messages with static ones to avoid logging user-controlled Accept headers, query parameters, and paths.

#994 (comment)

lidel · 2025-08-14T20:11:46Z

I've rebased instead of merge by mistake, then restored to original (reviewed) version, apologies for noise. 🙈

adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive

lidel · 2025-08-14T20:54:53Z

All green, merging.
(Added docs + raised DefaultMaxConcurrentRequests = 4096 to make rollout easier)

this includes race-condition fixes from ipfs/boxo#994 and increased DefaultMaxConcurrentRequests = 4096

this includes race-condition fixes from ipfs/boxo#994 and increased `DefaultMaxConcurrentRequests = 4096`

* feat(gateway): concurrency and timeout limits Depends on ipfs/boxo#994 * chore: boxo master with final boxo#994 this includes race-condition fixes from ipfs/boxo#994 and increased `DefaultMaxConcurrentRequests = 4096` * docs: concise config.md and changelog

feat(gw): concurrency and timeout limits

cc104f6

adds MaxConcurrentRequests closes #881 adds RetrievalTimeout Closes #679

lidel changed the title ~~feat(gw): concurrency and timeout limits~~ feat(gateway): concurrency and timeout limits Aug 11, 2025

lidel added 3 commits August 11, 2025 03:09

chore: lint

a5e29fa

refactor(gateway): Config.MetricsRegistry

d50bbe7

allow passing custom registry instead of the global one useful for testing and deployments with multiple gateway instances

chore: more lint

59b40ec

lidel mentioned this pull request Aug 9, 2025

Release 0.37 ipfs/kubo#10867

Closed

51 tasks

lidel self-assigned this Aug 11, 2025

lidel commented Aug 11, 2025

View reviewed changes

This was referenced Aug 11, 2025

feat(gateway): implement request concurrency limit #887

Closed

gateway: ability to set response write timeout #679

Closed

feat(gateway): add configurable response write timeout #812

Closed

lidel added a commit to ipfs/kubo that referenced this pull request Aug 11, 2025

feat(gateway): concurrency and timeout limits

a6b8c2c

Depends on ipfs/boxo#994

lidel mentioned this pull request Aug 11, 2025

feat(config): Gateway.RetrievalTimeout|MaxConcurrentRequests ipfs/kubo#10905

Merged

4 tasks

refactor: remove global metrics

43405e9

now they always init per handler

lidel force-pushed the feat-additional-gateway-limits branch from 731675e to e74051a Compare August 11, 2025 19:13

chore: gci formatting

209da44

go fmt and gci were flip-flopping due to comment in newline and conflicting rules, this removes the problem

lidel force-pushed the feat-additional-gateway-limits branch from e74051a to 209da44 Compare August 11, 2025 19:27

lidel added a commit to ipfs/rainbow that referenced this pull request Aug 11, 2025

feat: retrieval-timeout & max-concurrent-requests

a2f2bad

Depends on ipfs/boxo#994

lidel mentioned this pull request Aug 11, 2025

feat: retrieval-timeout & max-concurrent-requests ipfs/rainbow#285

Merged

lidel marked this pull request as ready for review August 11, 2025 19:43

lidel requested a review from a team as a code owner August 11, 2025 19:43

lidel added 2 commits August 11, 2025 23:06

fix(conformance): only-if-cached returns HTTP 412

e109608

aims to fix problem found in rainbow: https://github.com/ipfs/rainbow/actions/runs/16890332583/job/47848231793?pr=285#step:13:917

lidel commented Aug 11, 2025

View reviewed changes

gateway/backend_blocks.go Show resolved Hide resolved

lidel commented Aug 11, 2025

View reviewed changes

gateway/handler.go Show resolved Hide resolved

gammazero self-requested a review August 12, 2025 14:49

gammazero reviewed Aug 13, 2025

View reviewed changes

lidel added 2 commits August 13, 2025 22:20

chore: add visibility into failed requests

0705edd

#994 (comment)

refactor: simplify timeout for loop

9a506f6

removed: - unnecessary for loop in timeout goroutine (both select cases return) - unused done and timeoutSignal fields from timeoutWriter struct

lidel force-pushed the feat-additional-gateway-limits branch from 073955a to d2d586c Compare August 14, 2025 16:18

docs: MaxConcurrentRequests tuning guidance

b42675e

gammazero approved these changes Aug 14, 2025

View reviewed changes

fix(gateway): prevent user input in error logs

35bc575

replace dynamic error messages with static ones to avoid logging user-controlled Accept headers, query parameters, and paths.

lidel added a commit that referenced this pull request Aug 14, 2025

chore: add visibility into failed requests

bb58999

#994 (comment)

lidel added a commit that referenced this pull request Aug 14, 2025

refactor: simplify http.Error

4ce3900

#994 (comment)

lidel added a commit that referenced this pull request Aug 14, 2025

refactor: avoiding unnecessary []byte conversions

d713d43

#994 (comment)

lidel added a commit that referenced this pull request Aug 14, 2025

refactor for → switch

a8761d4

#994 (comment)

lidel added a commit that referenced this pull request Aug 14, 2025

fix: cleanup goroutine in tests

1a7747d

#994 (comment)

lidel added a commit that referenced this pull request Aug 14, 2025

test: 412 when not cached for only-if-cached

237e3c5

#994 (comment)

lidel force-pushed the feat-additional-gateway-limits branch from 35bc575 to d0e215c Compare August 14, 2025 19:53

Merge branch 'main' into feat-additional-gateway-limits

2a327db

lidel force-pushed the feat-additional-gateway-limits branch from d0e215c to 2a327db Compare August 14, 2025 20:11

lidel added 3 commits August 14, 2025 22:20

docs: godoc for gateway package

6f67f1e

chore: DefaultMaxConcurrentRequests = 4096

e170d19

adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive

chore: fix gateway/handler_car_test.go

9aff4dd

lidel removed the status/blocked Unable to be worked further until needs are met label Aug 14, 2025

lidel mentioned this pull request Aug 14, 2025

gateway: make it easier to retry from HTML error page for 504 Timeouts #427

Open

lidel merged commit 54b62d4 into main Aug 14, 2025
18 checks passed

lidel deleted the feat-additional-gateway-limits branch August 14, 2025 21:08

lidel added a commit to ipfs/rainbow that referenced this pull request Aug 14, 2025

chore: boxo master with final boxo#994

53e23ab

this includes race-condition fixes from ipfs/boxo#994 and increased DefaultMaxConcurrentRequests = 4096

lidel mentioned this pull request Aug 14, 2025

chore: boxo master with final boxo#994 ipfs/rainbow#287

Merged

lidel added a commit to ipfs/rainbow that referenced this pull request Aug 14, 2025

chore: boxo master with final boxo#994 (#287)

f3b0630

this includes race-condition fixes from ipfs/boxo#994 and increased DefaultMaxConcurrentRequests = 4096

lidel added a commit to ipfs/kubo that referenced this pull request Aug 14, 2025

chore: boxo master with final boxo#994

1a1f5c9

this includes race-condition fixes from ipfs/boxo#994 and increased `DefaultMaxConcurrentRequests = 4096`

lidel mentioned this pull request Jun 29, 2025

Update error pages on the gateways to surface debugging information — IPFS/2025 ipshipyard/roadmaps#14

Open

BrewTestBot mentioned this pull request Aug 27, 2025

ipfs 0.37.0 Homebrew/homebrew-core#235214

Merged

feat(gateway): concurrency and timeout limits #994

feat(gateway): concurrency and timeout limits #994

Uh oh!

Conversation

lidel commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

References

Uh oh!

lidel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidel Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

gammazero Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

lidel Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Extra notes on MaxConcurrentRequests and Nginx

Uh oh!

lidel Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gammazero Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidel commented Aug 14, 2025

Uh oh!

lidel commented Aug 14, 2025

Uh oh!

lidel commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

lidel commented Aug 11, 2025 •

edited

Loading

lidel left a comment •

edited

Loading

lidel Aug 14, 2025 •

edited

Loading

Extra notes on `MaxConcurrentRequests` and Nginx

codecov bot commented Aug 11, 2025 •

edited

Loading