-
Notifications
You must be signed in to change notification settings - Fork 131
feat(gateway): concurrency and timeout limits #994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
allow passing custom registry instead of the global one useful for testing and deployments with multiple gateway instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix racy metrics tests and lint after some sleep, but dropping some notes for early review
gateway/gateway.go
Outdated
// Default values for gateway configuration limits | ||
const ( | ||
// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval | ||
// (time to first byte) and subsequent writes to the HTTP response body. | ||
DefaultRetrievalTimeout = 30 * time.Second | ||
|
||
// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests | ||
// that the gateway will process. | ||
DefaultMaxConcurrentRequests = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ Do these make sense for defaults? My rationale:
- 30s feels solid (that is what we've been using at
ipfs.io
under Nginx'sproxy_read_timeout
) - 1024 is 2x the default from Nginx (
worker_connections
512)
Of course users can adjust/disable, this default is about "sane default" aiming at desktop users, and YOLO deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra notes on MaxConcurrentRequests
and Nginx
While testing fix on staging I had to adjust this default because our boxes can handle higher load:
nginx Configuration on staging box:
- worker_processes: auto (8 workers based on 8 CPU cores)
- worker_connections: 1024 per worker
- Total nginx capacity: 8 workers x 1024 = 8192 total connections
For Reverse Proxy:
Each proxied request uses 2 connections in nginx:
- Client → nginx (port 443)
- nginx → kubo (port 8080)
So iiuc nginx on staging box 02 can proxy ~ 8192/2 = 4096 concurrent requests
I've increased ipfs config Gateway.MaxConcurrentRequests --json 4096
and success rate issue went away – HTTP 200s success rate is on par with control box 01 that runs previous release (0.36), but impact on CPU/memory is much lower:
I think 1024 may be a sensible default (most of deployments run on weaker hardware and do not back % of traffic of ipfs.io
) so 1024 is actually reasonable for:
- stock nginx (256 proxy capacity)
- lightly tuned nginx (1-2k proxy capacity)
- single nginx worker setups
I'll just make sure it is properly documented. Update: b42675e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some thinking adjusted chore: DefaultMaxConcurrentRequests = 4096
Rationale in e170d19: adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive
now they always init per handler
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #994 +/- ##
==========================================
+ Coverage 61.39% 61.50% +0.10%
==========================================
Files 254 257 +3
Lines 31731 32161 +430
==========================================
+ Hits 19481 19780 +299
- Misses 10644 10765 +121
- Partials 1606 1616 +10
... and 6 files with indirect coverage changes 🚀 New features to boost your workflow:
|
731675e
to
e74051a
Compare
go fmt and gci were flip-flopping due to comment in newline and conflicting rules, this removes the problem
e74051a
to
209da44
Compare
aims to fix problem found in rainbow: https://github.com/ipfs/rainbow/actions/runs/16890332583/job/47848231793?pr=285#step:13:917
This aims to fix failure from: https://github.com/ipfs/rainbow/actions/runs/16892227598/job/47854552215?pr=285#step:13:753 Range request regression fix: Modified BlocksBackend.Get() in backend_blocks.go to optimize range requests by: - Detecting when a range request is made - Only fetching the root block first to check if it's a UnixFS file - Using lazy loading to fetch only the blocks needed for the requested range - This avoids the timeout issue when missing blocks exist outside the requested range
gateway/gateway.go
Outdated
// Default values for gateway configuration limits | ||
const ( | ||
// DefaultRetrievalTimeout is the default maximum duration for initial content retrieval | ||
// (time to first byte) and subsequent writes to the HTTP response body. | ||
DefaultRetrievalTimeout = 30 * time.Second | ||
|
||
// DefaultMaxConcurrentRequests is the default maximum number of concurrent HTTP requests | ||
// that the gateway will process. | ||
DefaultMaxConcurrentRequests = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 🚀
removed: - unnecessary for loop in timeout goroutine (both select cases return) - unused done and timeoutSignal fields from timeoutWriter struct
073955a
to
d2d586c
Compare
Fix from d2d586c seems to work on staging (02). No issues in the past 30m. Will let it run for an hour or two just to be sure. |
replace dynamic error messages with static ones to avoid logging user-controlled Accept headers, query parameters, and paths.
35bc575
to
d0e215c
Compare
d0e215c
to
2a327db
Compare
I've rebased instead of merge by mistake, then restored to original (reviewed) version, apologies for noise. 🙈 |
adjusting to higher value to avoid headache in production environments and users complaining for HTTP 429s. this should act as failsafe of last resort for now, we can adjust later but for now 4k makes rollout easier and less disruptive
All green, merging. |
this includes race-condition fixes from ipfs/boxo#994 and increased DefaultMaxConcurrentRequests = 4096
this includes race-condition fixes from ipfs/boxo#994 and increased DefaultMaxConcurrentRequests = 4096
this includes race-condition fixes from ipfs/boxo#994 and increased `DefaultMaxConcurrentRequests = 4096`
* feat(gateway): concurrency and timeout limits Depends on ipfs/boxo#994 * chore: boxo master with final boxo#994 this includes race-condition fixes from ipfs/boxo#994 and increased `DefaultMaxConcurrentRequests = 4096` * docs: concise config.md and changelog
This PR adds built-in gateway limiter middleware, configuration, and tests:
MaxConcurrentRequests
HTTP 429 Too Many Requests
responseRetrievalTimeout
Config.MetricsRegistry
TODO
initializeMiddlewareMetrics
Gateway.RetrievalTimeout|MaxConcurrentRequests
kubo#10905only-if-cached
(e109608)References