Skip to content

Conversation

lidel
Copy link
Member

@lidel lidel commented Aug 27, 2025

This PR adds diagnostic tracking for IPFS content retrieval to provide meaningful error context when timeouts occur. This helps users understand exactly where and why retrieval failed.

We can refine presentation in future PRs.

TLDR

  • New boxo/retrieval package with RetrievalState for tracking retrieval phases
  • Tracks provider discovery stats (found/attempted/connected)
  • Records failed provider peer IDs for debugging
  • Monotonic phase progression through retrieval stages
  • Thread-safe concurrent access with atomic operations

Integration points

The high level flow is:

  1. Gateway handler creates context with RetrievalState: retrieval.ContextWithState(r.Context())
  2. Gateway wraps context for request with session: blockservice.ContextWithSession(ctx, bb.blockService)
  3. When blocks are fetched, blockservice creates exchange session with this context: sesEx.NewSession(s.sesctx)
  4. Bitswap session manager creates session with this context
  5. When the bitswap session calls FindProvidersAsync, it passes this context
  6. ProviderQueryManager receives the context and can access the RetrievalState: retrieval.StateFromContext(ctx)
  7. The ProvidersFound.Add(1) call in providerquerymanager updates the RetrievalState that will be included in the 504 timeout error message etc
  8. Retrieval continues (or timeouts)
  9. The RetrievalState.Summary() provides human-readable diagnostics like seen in "Demo" section below

In practice, I had to make sure we manually pass/preserve state in two places, where context.Background() is used instead of r.Context() of the HTTP request:

  1. In routing/providerquerymanager/providerquerymanager.go: when creating a new context for provider queries, now copies the RetrievalState from the original request context
  2. In bitswap/client/client.go: when creating a session context in GetBlocks(), now preserves the RetrievalState from the request context

Demo

Basic test where there are no providers for a CID:

Screen Shot 2025-08-27 at 04 47 27

Extra test where there was one provider in DHT, but I've shut it down, so Gateway found it, but failed to connect. We show up to 3 sample peers that failed:

Screen Shot 2025-08-27 at 05 00 53
Before After
image image

lidel added 2 commits August 27, 2025 04:44
Add diagnostic tracking for IPFS content retrieval to provide meaningful
error context when timeouts occur. This helps users understand exactly
where and why retrieval failed.

Key additions:
- New boxo/retrieval package with RetrievalState for tracking retrieval phases
- Tracks provider discovery stats (found/attempted/connected)
- Records failed provider peer IDs for debugging
- Monotonic phase progression through retrieval stages
- Thread-safe concurrent access with atomic operations

Integration points:
- Gateway middleware adds RetrievalState to request context
- Provider query manager tracks provider discovery phase and stats
- Bitswap client preserves state through session creation
- Gateway backends set appropriate phases during path resolution

The RetrievalState.Summary() provides human-readable diagnostics like:
- "No providers found for the CID (phase: provider discovery)"
- "Found 5 provider(s), connected to 2, but they did not return content"
- "Timeout occurred after finding 3 provider(s) and connecting to 1"

This significantly improves debugging of timeout errors by showing users
whether the issue was finding providers, connecting to them, or retrieving
blocks from connected peers.
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 74.57627% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.58%. Comparing base (30868de) to head (a7bc35d).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...uting/providerquerymanager/providerquerymanager.go 19.04% 12 Missing and 5 partials ⚠️
gateway/backend_blocks.go 23.07% 9 Missing and 1 partial ⚠️
retrieval/state.go 92.10% 7 Missing and 2 partials ⚠️
gateway/backend_car.go 25.00% 4 Missing and 2 partials ⚠️
bitswap/client/client.go 40.00% 2 Missing and 1 partial ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1015      +/-   ##
==========================================
+ Coverage   60.47%   60.58%   +0.11%     
==========================================
  Files         267      268       +1     
  Lines       33276    33409     +133     
==========================================
+ Hits        20124    20242     +118     
- Misses      11485    11494       +9     
- Partials     1667     1673       +6     
Files with missing lines Coverage Δ
gateway/middleware_retrieval_timeout.go 78.44% <100.00%> (+1.97%) ⬆️
bitswap/client/client.go 78.27% <40.00%> (+0.87%) ⬆️
gateway/backend_car.go 48.64% <25.00%> (-0.22%) ⬇️
retrieval/state.go 92.10% <92.10%> (ø)
gateway/backend_blocks.go 40.24% <23.07%> (-1.84%) ⬇️
...uting/providerquerymanager/providerquerymanager.go 82.97% <19.04%> (-4.21%) ⬇️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Include failed peer IDs when connection attempts fail, not just when
providers don't return content. This provides more debugging info for
understanding which specific peers couldn't be reached.
@lidel lidel mentioned this pull request Aug 27, 2025
27 tasks
@lidel lidel requested a review from gammazero August 27, 2025 03:14
@lidel lidel marked this pull request as ready for review August 27, 2025 03:18
@lidel lidel requested a review from a team as a code owner August 27, 2025 03:18
Copy link
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple minor comments.

Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok code-wise...

I have some questions on handling of failures... for example, the resolution phase can trigger block fetches to be able to resolve a path, but I'm guessing a failure to do so will not provide information about failed peers etc ?

Also, PhaseDataRetrieval is only set once some bytes have started flowing. I don't think we can show a custom error page with info from that moment, so this phase would not be seen by the user right?

There is a rather large gap between "Connected" and "DataRetrieval". I.e. the connected peer may fail to send the block, or the HTTP request may 404 even if if the provider record pointed to someone. That makes me wonder if there should be another step along the lines of DataRequested, but I'm guessing that pass the StateTracking down to that level as CIDs from many sessions are mashed up into wantlists.

- remove pre-allocation of failedProviders slice
- use slices.Clone instead of make+copy pattern
wrap block fetching and path resolution errors with retrieval state
to provide detailed diagnostics about failed retrievals. gateway users
now get insights into provider discovery, connection attempts, and
which peers failed. this same pattern can be extended to cli tools
for better timeout debugging across ipfs.
ensure GetBlock method in BlocksBackend also propagates retrieval
diagnostics when block fetching fails, providing consistent error
context across all gateway block operations
@lidel
Copy link
Member Author

lidel commented Sep 1, 2025

@gammazero addressed feedback, thanks!

@hsanjuan good catch: the resolution phase can trigger block fetches, and the code was already correctly tracking it, we just did not forward the state from context to the final error message. Fixed in ab647d5 by opportunistically wrapping errors with extra diagnostics (if present in context), and then fixed it further in #1023 by tracking found peers separately from ones that explicitly failed, and use them when timeout hits without ANY peer being reachable.

Also, PhaseDataRetrieval is only set once some bytes have started flowing.

correct, the last phase is there mostly for completeness

This PR is just beginning:

  • In the future we may use it. not sure for what, maybe to provide more meaningful truncation error when we timeout in the middle of a big CAR stream (?format=car) when some of internal blocks can't be retrieved and HTTP 200 response stream dies.
  • Future PRs will narrow the gap between "Connected" and "DataRetrieval"
  • There will also be better UX with buttons to "Retry" (force page reload) and "Inspect CID at check.ipfs.network". I have local working code, but will submit it after this one lands, to make reviews easier.

@lidel
Copy link
Member Author

lidel commented Sep 1, 2025

Just to be sure this did not break anything, created PR in Kubo with this PR:

Merging.

Update: continued in:

use errors.As() for type extraction, errors.Is() for wrapped errors
@lidel lidel merged commit 8b6b1d2 into main Sep 1, 2025
17 checks passed
@lidel lidel deleted the feat/retrieval-timeout-diagnostics branch September 1, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update error pages on the gateways to surface debugging information — IPFS/2025
3 participants