Skip to content

Conversation

dapplion
Copy link
Collaborator

Issue Addressed

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

  • Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
  • Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
  • The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

  • Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
  • Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
  • Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
    • Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long
    • Backfill sync: pauses the sync until another peer joins

TODOs

  • Re-add a mechanism to de-prioritize bad peers. Before we tracked peers that failed a specific batch and de-prioritize those. This is a bit trickier to do now because data columns by range request deal with peer groups. Instead we could
    • Count failures per peer in the network context and use those. I really wonder on the value of tracking failures on a specific batch vs all batches
    • Use the existing app peer score, which is a proxy for the count of failures
  • Add tests :)

Note: this touches the mainnet path!

@dapplion dapplion added work-in-progress PR is a work-in-progress das Data Availability Sampling labels Jan 25, 2025
@dapplion dapplion requested a review from jxs as a code owner January 25, 2025 13:21
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review this, but just wanted to throw in my 2 cents that I think the idea of bringing up the logic into network context is a good idea.

There's still some todo's in here, I guess these can be addressed later.

}
// TODO(das): previously here we de-prioritize peers that had failed to download or
// process a batch
self.send_batch(network, batch_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function does nothing now and can just be replaced with send_batch? (Unless implementing the TODO)

Copy link

mergify bot commented Feb 3, 2025

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

@jimmygchen jimmygchen added syncing waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 3, 2025
@dapplion
Copy link
Collaborator Author

dapplion commented Feb 4, 2025

@mergify mergify bot deleted the branch sigp:sync-active-request-byrange February 5, 2025 07:08
@mergify mergify bot closed this Feb 5, 2025
@jimmygchen
Copy link
Member

@dapplion this PR got closed due to base branch being deleted, would you mind re-opening this against unstable?

mergify bot pushed a commit that referenced this pull request May 7, 2025
- Re-opens #6864 targeting unstable

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

- Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
- Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
- The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.


  - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
- [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer`
- [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly
- [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR
- [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable

### TODOs

- [ ] Add tests :)
- [x] Manually test backfill sync

Note: this touches the mainnet path!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling syncing waiting-on-author The reviewer has suggested changes and awaits thier implementation. work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants