Skip to content

Conversation

lidel
Copy link
Member

@lidel lidel commented Aug 11, 2025

This PR adds built-in gateway limiter middleware, configuration, and tests:

  • adds MaxConcurrentRequests
  • adds RetrievalTimeout
    • TLDR enforces a maximum duration for content retrieval:
      • Time to first byte: If the gateway cannot start writing the response within this duration (e.g., stuck searching for providers), a 504 Gateway Timeout is returned.
      • Time between non-empty writes: After the first byte, the timeout resets each time new bytes are written to the client. If the gateway cannot write additional data within this duration after the last successful write, the response is terminated.
    • closes gateway: ability to set response write timeout #679
    • also helps with A timeout is required when fetching blocks #908 (if gateway is used as the high level interface for retrieval)
  • adds optional Config.MetricsRegistry
    • TLDR allows removing dependency on global registry, allowing for muyltiple instances (e.g. in parallel tests)

TODO

References

adds MaxConcurrentRequests
closes #881

adds RetrievalTimeout
Closes #679
@lidel lidel changed the title feat(gw): concurrency and timeout limits feat(gateway): concurrency and timeout limits Aug 11, 2025
lidel added 3 commits August 11, 2025 03:09
allow passing custom registry instead of the global one
useful for testing and deployments with multiple gateway instances
@lidel lidel mentioned this pull request Aug 9, 2025
51 tasks
@lidel lidel self-assigned this Aug 11, 2025
Copy link
Member Author

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix racy metrics tests and lint after some sleep, but dropping some notes for early review

Comment on lines 20 to 43
// Default values for gateway configuration limits
const (
// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval
// (time to first byte) and subsequent writes to the HTTP response body.
DefaultRetrievalTimeout = 30 * time.Second

// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests
// that the gateway will process.
DefaultMaxConcurrentRequests = 1024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Do these make sense for defaults? My rationale:

Of course users can adjust/disable, this default is about "sane default" aiming at desktop users, and YOLO deployments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 🚀

Copy link
Member Author

@lidel lidel Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra notes on MaxConcurrentRequests and Nginx

While testing fix on staging I had to adjust this default because our boxes can handle higher load:

nginx Configuration on staging box:

  • worker_processes: auto (8 workers based on 8 CPU cores)
  • worker_connections: 1024 per worker
  • Total nginx capacity: 8 workers x 1024 = 8192 total connections

For Reverse Proxy:
Each proxied request uses 2 connections in nginx:

  1. Client → nginx (port 443)
  2. nginx → kubo (port 8080)

So iiuc nginx on staging box 02 can proxy ~ 8192/2 = 4096 concurrent requests

I've increased ipfs config Gateway.MaxConcurrentRequests --json 4096 and success rate issue went away – HTTP 200s success rate is on par with control box 01 that runs previous release (0.36), but impact on CPU/memory is much lower:

I think 1024 may be a sensible default (most of deployments run on weaker hardware and do not back % of traffic of ipfs.io) so 1024 is actually reasonable for:

  • stock nginx (256 proxy capacity)
  • lightly tuned nginx (1-2k proxy capacity)
  • single nginx worker setups

I'll just make sure it is properly documented. Update: b42675e

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thinking adjusted chore: DefaultMaxConcurrentRequests = 4096

Rationale in e170d19: adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive

now they always init per handler
Copy link

codecov bot commented Aug 11, 2025

Codecov Report

❌ Patch coverage is 69.97930% with 145 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.50%. Comparing base (44a4890) to head (9aff4dd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
gateway/backend_blocks.go 5.88% 63 Missing and 1 partial ⚠️
gateway/middleware_retrieval_timeout.go 76.47% 31 Missing and 5 partials ⚠️
gateway/handler.go 70.00% 11 Missing and 4 partials ⚠️
gateway/middleware_metrics.go 88.59% 11 Missing and 2 partials ⚠️
gateway/handler_codec.go 0.00% 5 Missing ⚠️
gateway/errors.go 90.32% 2 Missing and 1 partial ⚠️
gateway/handler_ipns_record.go 0.00% 3 Missing ⚠️
gateway/handler_unixfs_dir.go 0.00% 3 Missing ⚠️
gateway/metrics.go 85.00% 0 Missing and 3 partials ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #994      +/-   ##
==========================================
+ Coverage   61.39%   61.50%   +0.10%     
==========================================
  Files         254      257       +3     
  Lines       31731    32161     +430     
==========================================
+ Hits        19481    19780     +299     
- Misses      10644    10765     +121     
- Partials     1606     1616      +10     
Files with missing lines Coverage Δ
gateway/gateway.go 83.54% <100.00%> (ø)
gateway/handler_car.go 79.79% <100.00%> (-0.11%) ⬇️
gateway/handler_tar.go 84.12% <100.00%> (ø)
gateway/middleware_ratelimit.go 100.00% <100.00%> (ø)
gateway/errors.go 86.01% <90.32%> (+1.39%) ⬆️
gateway/handler_ipns_record.go 17.91% <0.00%> (-0.84%) ⬇️
gateway/handler_unixfs_dir.go 64.94% <0.00%> (ø)
gateway/metrics.go 80.15% <85.00%> (-2.71%) ⬇️
gateway/handler_codec.go 61.68% <0.00%> (-0.59%) ⬇️
gateway/middleware_metrics.go 88.59% <88.59%> (ø)
... and 3 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lidel lidel force-pushed the feat-additional-gateway-limits branch from 731675e to e74051a Compare August 11, 2025 19:13
go fmt and gci were flip-flopping due to comment in newline and
conflicting rules, this removes the problem
@lidel lidel force-pushed the feat-additional-gateway-limits branch from e74051a to 209da44 Compare August 11, 2025 19:27
lidel added a commit to ipfs/rainbow that referenced this pull request Aug 11, 2025
@lidel lidel marked this pull request as ready for review August 11, 2025 19:43
@lidel lidel requested a review from a team as a code owner August 11, 2025 19:43
lidel added 2 commits August 11, 2025 23:06
This aims to fix failure from:
https://github.com/ipfs/rainbow/actions/runs/16892227598/job/47854552215?pr=285#step:13:753

Range request regression fix: Modified BlocksBackend.Get()
in backend_blocks.go to optimize range requests by:
- Detecting when a range request is made
- Only fetching the root block first to check if it's a UnixFS file
- Using lazy loading to fetch only the blocks needed for the requested range
- This avoids the timeout issue when missing blocks exist outside the requested range
@gammazero gammazero self-requested a review August 12, 2025 14:49
Comment on lines 20 to 43
// Default values for gateway configuration limits
const (
// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval
// (time to first byte) and subsequent writes to the HTTP response body.
DefaultRetrievalTimeout = 30 * time.Second

// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests
// that the gateway will process.
DefaultMaxConcurrentRequests = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 🚀

lidel added 2 commits August 13, 2025 22:20
removed:
- unnecessary for loop in timeout goroutine (both select cases return)
- unused done and timeoutSignal fields from timeoutWriter struct
@lidel lidel force-pushed the feat-additional-gateway-limits branch from 073955a to d2d586c Compare August 14, 2025 16:18
@lidel
Copy link
Member Author

lidel commented Aug 14, 2025

Fix from d2d586c seems to work on staging (02). No issues in the past 30m. Will let it run for an hour or two just to be sure.

replace dynamic error messages with static ones to avoid logging
user-controlled Accept headers, query parameters, and paths.
lidel added a commit that referenced this pull request Aug 14, 2025
lidel added a commit that referenced this pull request Aug 14, 2025
lidel added a commit that referenced this pull request Aug 14, 2025
lidel added a commit that referenced this pull request Aug 14, 2025
@lidel lidel force-pushed the feat-additional-gateway-limits branch from 35bc575 to d0e215c Compare August 14, 2025 19:53
@lidel lidel force-pushed the feat-additional-gateway-limits branch from d0e215c to 2a327db Compare August 14, 2025 20:11
@lidel
Copy link
Member Author

lidel commented Aug 14, 2025

I've rebased instead of merge by mistake, then restored to original (reviewed) version, apologies for noise. 🙈

lidel added 3 commits August 14, 2025 22:20
adjusting to higher value to avoid headache in production environments
and users complaining for HTTP 429s. this should  act as failsafe of
last resort for now, we can adjust later but for now 4k makes rollout
easier and less disruptive
@lidel lidel removed the status/blocked Unable to be worked further until needs are met label Aug 14, 2025
@lidel
Copy link
Member Author

lidel commented Aug 14, 2025

All green, merging.
(Added docs + raised DefaultMaxConcurrentRequests = 4096 to make rollout easier)

@lidel lidel merged commit 54b62d4 into main Aug 14, 2025
18 checks passed
@lidel lidel deleted the feat-additional-gateway-limits branch August 14, 2025 21:08
lidel added a commit to ipfs/rainbow that referenced this pull request Aug 14, 2025
this includes race-condition fixes from
ipfs/boxo#994
and increased
DefaultMaxConcurrentRequests = 4096
lidel added a commit to ipfs/rainbow that referenced this pull request Aug 14, 2025
this includes race-condition fixes from
ipfs/boxo#994
and increased
DefaultMaxConcurrentRequests = 4096
lidel added a commit to ipfs/kubo that referenced this pull request Aug 14, 2025
this includes race-condition fixes from ipfs/boxo#994
and increased `DefaultMaxConcurrentRequests = 4096`
lidel added a commit to ipfs/kubo that referenced this pull request Aug 15, 2025
* feat(gateway): concurrency and timeout limits

Depends on ipfs/boxo#994

* chore: boxo master with final boxo#994

this includes race-condition fixes from ipfs/boxo#994
and increased `DefaultMaxConcurrentRequests = 4096`

* docs: concise config.md and changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gateway: limit the number of requests in flight gateway: ability to set response write timeout
2 participants