-
Notifications
You must be signed in to change notification settings - Fork 131
feat(gateway): add retrieval state tracking for timeout diagnostics #1015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add diagnostic tracking for IPFS content retrieval to provide meaningful error context when timeouts occur. This helps users understand exactly where and why retrieval failed. Key additions: - New boxo/retrieval package with RetrievalState for tracking retrieval phases - Tracks provider discovery stats (found/attempted/connected) - Records failed provider peer IDs for debugging - Monotonic phase progression through retrieval stages - Thread-safe concurrent access with atomic operations Integration points: - Gateway middleware adds RetrievalState to request context - Provider query manager tracks provider discovery phase and stats - Bitswap client preserves state through session creation - Gateway backends set appropriate phases during path resolution The RetrievalState.Summary() provides human-readable diagnostics like: - "No providers found for the CID (phase: provider discovery)" - "Found 5 provider(s), connected to 2, but they did not return content" - "Timeout occurred after finding 3 provider(s) and connecting to 1" This significantly improves debugging of timeout errors by showing users whether the issue was finding providers, connecting to them, or retrieving blocks from connected peers.
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #1015 +/- ##
==========================================
+ Coverage 60.47% 60.58% +0.11%
==========================================
Files 267 268 +1
Lines 33276 33409 +133
==========================================
+ Hits 20124 20242 +118
- Misses 11485 11494 +9
- Partials 1667 1673 +6
... and 16 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Include failed peer IDs when connection attempts fail, not just when providers don't return content. This provides more debugging info for understanding which specific peers couldn't be reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok code-wise...
I have some questions on handling of failures... for example, the resolution phase can trigger block fetches to be able to resolve a path, but I'm guessing a failure to do so will not provide information about failed peers etc ?
Also, PhaseDataRetrieval is only set once some bytes have started flowing. I don't think we can show a custom error page with info from that moment, so this phase would not be seen by the user right?
There is a rather large gap between "Connected" and "DataRetrieval". I.e. the connected peer may fail to send the block, or the HTTP request may 404 even if if the provider record pointed to someone. That makes me wonder if there should be another step along the lines of DataRequested
, but I'm guessing that pass the StateTracking down to that level as CIDs from many sessions are mashed up into wantlists.
- remove pre-allocation of failedProviders slice - use slices.Clone instead of make+copy pattern
wrap block fetching and path resolution errors with retrieval state to provide detailed diagnostics about failed retrievals. gateway users now get insights into provider discovery, connection attempts, and which peers failed. this same pattern can be extended to cli tools for better timeout debugging across ipfs.
ensure GetBlock method in BlocksBackend also propagates retrieval diagnostics when block fetching fails, providing consistent error context across all gateway block operations
@gammazero addressed feedback, thanks! @hsanjuan good catch: the resolution phase can trigger block fetches, and the code was already correctly tracking it, we just did not forward the state from context to the final error message. Fixed in ab647d5 by opportunistically wrapping errors with extra diagnostics (if present in context), and then fixed it further in #1023 by tracking found peers separately from ones that explicitly failed, and use them when timeout hits without ANY peer being reachable.
correct, the last phase is there mostly for completeness This PR is just beginning:
|
Just to be sure this did not break anything, created PR in Kubo with this PR: Merging. Update: continued in: |
use errors.As() for type extraction, errors.Is() for wrapped errors
This PR adds diagnostic tracking for IPFS content retrieval to provide meaningful error context when timeouts occur. This helps users understand exactly where and why retrieval failed.
We can refine presentation in future PRs.
TLDR
RetrievalState
for tracking retrieval phasesIntegration points
The high level flow is:
RetrievalState
:retrieval.ContextWithState(r.Context())
blockservice.ContextWithSession(ctx, bb.blockService)
sesEx.NewSession(s.sesctx)
FindProvidersAsync
, it passes this contextProviderQueryManager
receives the context and can access theRetrievalState
:retrieval.StateFromContext(ctx)
ProvidersFound.Add(1)
call inproviderquerymanager
updates theRetrievalState
that will be included in the 504 timeout error message etcRetrievalState.Summary()
provides human-readable diagnostics like seen in "Demo" section belowIn practice, I had to make sure we manually pass/preserve state in two places, where
context.Background()
is used instead ofr.Context()
of the HTTP request:routing/providerquerymanager/providerquerymanager.go
: when creating a new context for provider queries, now copies the RetrievalState from the original request contextbitswap/client/client.go
: when creating a session context in GetBlocks(), now preserves the RetrievalState from the request contextDemo
Basic test where there are no providers for a CID:
Extra test where there was one provider in DHT, but I've shut it down, so Gateway found it, but failed to connect. We show up to 3 sample peers that failed: