Make range sync peer loadbalancing PeerDAS-friendly #6864

dapplion · 2025-01-25T13:21:55Z

Issue Addressed

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
- Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long
- Backfill sync: pauses the sync until another peer joins

TODOs

Re-add a mechanism to de-prioritize bad peers. Before we tracked peers that failed a specific batch and de-prioritize those. This is a bit trickier to do now because data columns by range request deal with peer groups. Instead we could
- Count failures per peer in the network context and use those. I really wonder on the value of tracking failures on a specific batch vs all batches
- Use the existing app peer score, which is a proxy for the count of failures
Add tests :)

Note: this touches the mainnet path!

AgeManning

I didn't review this, but just wanted to throw in my 2 cents that I think the idea of bringing up the logic into network context is a good idea.

There's still some todo's in here, I guess these can be addressed later.

AgeManning · 2025-02-03T04:36:10Z

beacon_node/network/src/sync/backfill_sync/mod.rs

-        }
+        // TODO(das): previously here we de-prioritize peers that had failed to download or
+        // process a batch
+        self.send_batch(network, batch_id)


It looks like this function does nothing now and can just be replaced with send_batch? (Unless implementing the TODO)

mergify · 2025-02-03T04:40:14Z

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

dapplion · 2025-02-04T18:27:35Z

Blocked by

Add individual by_range sync requests #6497

jimmygchen · 2025-02-05T11:09:47Z

@dapplion this PR got closed due to base branch being deleted, would you mind re-opening this against unstable?

- Re-opens #6864 targeting unstable Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers. Issues with current unstable: - Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted - Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers - The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything. - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range) - [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer` - [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly - [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR - [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable ### TODOs - [ ] Add tests :) - [x] Manually test backfill sync Note: this touches the mainnet path!

dapplion added 7 commits January 25, 2025 14:37

Remove request tracking inside syncing chains

3831e0e

Prioritize by range peers in network context

39e3d9b

Prioritize custody peers for columns by range

0a35559

Explicit error handling of the no peers error case

c6c76e8

Remove good_peers_on_sampling_subnets

667d9cc

Count AwaitingDownload towards the buffer limit

5a1bbc0

Retry syncing chains in AwaitingDownload state

57ce773

dapplion added work-in-progress PR is a work-in-progress das Data Availability Sampling labels Jan 25, 2025

dapplion requested review from jimmygchen and AgeManning January 25, 2025 13:21

dapplion requested a review from jxs as a code owner January 25, 2025 13:21

dapplion mentioned this pull request Jan 25, 2025

Implement PeerDAS Fulu fork activation #6795

Merged

AgeManning approved these changes Feb 3, 2025

View reviewed changes

jimmygchen added syncing waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 3, 2025

mergify bot deleted the branch sigp:sync-active-request-byrange February 5, 2025 07:08

mergify bot closed this Feb 5, 2025

dapplion mentioned this pull request Feb 6, 2025

Make range sync peer loadbalancing PeerDAS-friendly #6922

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make range sync peer loadbalancing PeerDAS-friendly #6864

Make range sync peer loadbalancing PeerDAS-friendly #6864

Uh oh!

dapplion commented Jan 25, 2025

Uh oh!

AgeManning left a comment

Uh oh!

AgeManning Feb 3, 2025

Uh oh!

mergify bot commented Feb 3, 2025

Uh oh!

dapplion commented Feb 4, 2025

Uh oh!

jimmygchen commented Feb 5, 2025

Uh oh!

Uh oh!

Make range sync peer loadbalancing PeerDAS-friendly #6864

Make range sync peer loadbalancing PeerDAS-friendly #6864

Uh oh!

Conversation

dapplion commented Jan 25, 2025

Issue Addressed

Proposed Changes

TODOs

Uh oh!

AgeManning left a comment

Choose a reason for hiding this comment

Uh oh!

AgeManning Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 3, 2025

Uh oh!

dapplion commented Feb 4, 2025

Uh oh!

jimmygchen commented Feb 5, 2025

Uh oh!

Uh oh!