Retry custody requests after peer metadata updates #6975

dapplion · 2025-02-11T02:48:41Z

Issue Addressed

We need sync to retry custody requests when a peer CGC updates. A higher CGC can result in a data column subnet peer count increasing from 0 to 1, allowing requests to happen.

Proposed Changes

Add new sync event SyncMessage::UpdatedPeerCgc. It's sent by the router when a metadata response updates the known CGC

jimmygchen

Overall looks good to me! I think we need to attempt to make progress on range sync as well?

beacon_node/lighthouse_network/src/peer_manager/mod.rs

jimmygchen · 2025-02-11T05:49:58Z

beacon_node/network/src/sync/manager.rs

@@ -483,6 +486,13 @@ impl<T: BeaconChainTypes> SyncManager<T> {
        }
    }

+    fn updated_peer_cgc(&mut self, _peer_id: PeerId) {


I think we want to resume by range request as well?

Fixed, I resume range sync aswell

Nice, I'll run some tests today to confirm this fixes the issue.

Doesn't look like this fixes the issue. I still don't see range requests until a few minutes later until we add a new finalized chain.

Right after startup, waiting for custody peers

Feb 12 04:33:07.181 DEBG Waiting for peers to be available on sampling column subnets, chain: 1, service: range_sync, service: sync, module: network::sync::range_sync::chain:1057

Got peer metadata response after 15s

Feb 12 04:33:22.271 DEBG Obtained peer's metadata, new_seq_no: 6, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompkoge8T2x4KFH4KbzDkh7zz6uN2JX, service: libp2p, module: lighthouse_network::peer_manager:732

No range requests until ~5 mins later

Feb 12 04:38:07.303 DEBG Finalization sync peer joined, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompkoge8T2x4KFH4KbzDkh7zz6uN2JX, service: range_sync, service: sync, module: network::sync::range_sync::range:143 Feb 12 04:38:07.305 DEBG New chain added to sync, id: 2, from: 38, to: 1071, end_root: 0x3be00d7ce6e52f7938fd588d909055f72469d0f09ce545d7b23077f2d6b40e8a, current_target: 38, batches: 0, peers: 1, state: Stopped, sync_type: Finalized, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompk oge8T2x4KFH4KbzDkh7zz6uN2JX, service: range_sync, service: sync, module: network::sync::range_sync::chain_collection:506 Feb 12 04:38:07.306 DEBG Sync RPC request sent, id: 4/3/RangeSync/39/1, peer: 16Uiu2HAmAAZ5wP6fvpe1b9tWNmgA2Wn8MsrNYVhkw5WohcAoaHKR, epoch: 39, slots: 32, method: BlocksByRange, service: sync, module: network::sync::network_context:788 Feb 12 04:38:07.307 DEBG Sync RPC request sent, id: 5/3/RangeSync/39/1, peer: 16Uiu2HAm9PijSZpm5QUphXRoBtkhUZPkGJ4Rgxk4Bny91oZPYZLG, columns: [74, 30, 39, 19, 63, 41, 52, 47, 58], epoch: 39, slots: 32, method: DataColumnsByRange, service: sync, module: network::sync::network_context:870

Did you saw this log? Received updated peer CGC message

I don't recall seeing this. I think I was using the right locally-built image, but can be worth re-testing to confirm if you have time.

@dapplion could you retest this change?

I've retested this, and it doesn't seem to trigger retry after obtaining the peers metadata, and the Received updated peer CGC message log was not observed.

May 06 07:16:46.854 DEBUG Waiting for peers to be available on custody column subnets chain: 1service: "range_sync" ... # obtained peers metadata May 06 07:17:01.389 DEBUG Obtained peer's metadata peer_id: 16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW, new_seq_no: 6, service: "network" May 06 07:17:01.393 DEBUG Obtained peer's metadata peer_id: 16Uiu2HAmKjMmG6VJG2oi5Gd7x4iszsv4Fm9JZWeEad62kTVF9dke, new_seq_no: 132, service: "network" ... # no range request until ~5 mins later when a new chain is added May 06 07:21:46.397 DEBUG New chain added to sync peer_id: "16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW", sync_type: Finalized, id: 5, start_epoch: 0, target_head_slot: 193, target_head_root: 0xc1802494a0935c8bee449a412f982cc1903ec791565d 9cf9eaab233f2cd2bc90, component: "range_sync" May 06 07:21:46.397 DEBUG Sync RPC request sent method: "DataColumnsByRange", slots: 32, epoch: 1, columns: [8, 69, 99, 90, 9, 84, 24, 71, 12, 121, 50, 68, 127, 66, 36, 26, 41, 107, 47, 52, 108, 98, 70, 100, 54, 35, 21, 60, 49, 120, 10, 72, 44, 93, 67, 0, 122, 19, 43, 62, 97, 115, 59, 95, 48, 11, 101, 116, 34, 20, 25, 18, 3, 124, 110, 57, 46, 83, 77, 105, 81, 106, 1, 45, 104, 22, 29, 53, 6, 32, 13, 119, 4, 78, 63, 86, 92, 7, 114, 73, 16, 17, 109, 28, 75, 102, 80, 40, 79, 51, 125, 38, 85, 89, 42, 39, 126, 113, 87, 96, 37, 88, 64, 112, 14, 118, 76, 117, 58, 82, 27, 94, 74, 23, 30, 111, 15, 2, 123, 33, 55, 5, 65, 91, 31, 56], peer: 16Uiu2HAmKjMmG6VJG2oi5Gd7x4iszsv4Fm9JZWeEad62kTVF9dke, id: 8/6/RangeSync/1/1, chain: 1, service: "range_sync" May 06 07:21:46.397 DEBUG Sync RPC request sent method: "DataColumnsByRange", slots: 32, epoch: 1, columns: [61, 103], peer: 16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW, id: 9/6/RangeSync/1/1, chain: 1, service: "range_sync"

See #6975 (comment)

mergify · 2025-03-10T23:11:23Z

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

beacon_node/lighthouse_network/src/service/mod.rs

AgeManning

From what I can see here, you only care if the custody columns get updated.

I think we should try and maintain all our internal consistencies, i.e Internal Requests stay internal else it can get very confusing.

The solution you have here has redundant cases, like if its not updated, we send the event up to the router which then throws it away.

To keep everything consistent, I think we should do the following:

Add a new event to NetworkEvent i.e
NetworkEvent::PeerUpdatedCustodyGroupCount(PeerId)
In the handling of a metadata response, if the cgc is updated, then return the above event
Handle this event and send to sync

This has the benefits of:

Not sending redundant messages with redundant information (i.e the original metadata)
We maintain the logic that internal messages stay internal
I think its a bit clearer the purpose of the message and why its there.

jimmygchen · 2025-05-08T07:45:14Z

Thanks @pawanjay176 @AgeManning ! yeah make sense, I've updated the code as per your suggestions. I think it's much cleaner.

dapplion

✅ This approach looks much cleaner! Thanks @AgeManning

pawanjay176

Nice!

mergify · 2025-05-08T18:08:47Z

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

☑️ local-testnet-success

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

jimmygchen · 2025-05-09T07:43:17Z

@mergify requeue

mergify · 2025-05-09T07:43:24Z

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

Retry custody requests after peer metadata updates

4894cc2

dapplion requested a review from jxs as a code owner February 11, 2025 02:48

dapplion requested a review from jimmygchen February 11, 2025 02:48

dapplion added ready-for-review The code is ready for review das Data Availability Sampling labels Feb 11, 2025

jimmygchen requested changes Feb 11, 2025

View reviewed changes

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. peerdas-devnet-4 and removed ready-for-review The code is ready for review labels Feb 11, 2025

PR review

0428fbb

dapplion added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 11, 2025

dapplion requested a review from jimmygchen February 11, 2025 17:00

jimmygchen approved these changes Feb 11, 2025

View reviewed changes

jimmygchen added under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Feb 12, 2025

Merge branch 'unstable' into sync-metadata

120b8db

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed under-review A reviewer has only partially completed a review. labels Apr 1, 2025

jimmygchen mentioned this pull request May 6, 2025

Make range sync peer loadbalancing PeerDAS-friendly #6922

Merged

7 tasks

jimmygchen self-assigned this May 6, 2025

jimmygchen added the under-review A reviewer has only partially completed a review. label May 6, 2025

Merge remote-tracking branch 'origin/unstable' into sync-metadata

ba0b62b

jimmygchen force-pushed the sync-metadata branch from 6c8def2 to ba0b62b Compare May 6, 2025 06:26

jimmygchen removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 6, 2025

Add a test to verify metadata response handling in PeerManager.

28e64fa

jimmygchen reviewed May 6, 2025

View reviewed changes

beacon_node/lighthouse_network/src/service/mod.rs Outdated Show resolved Hide resolved

jimmygchen reviewed May 6, 2025

View reviewed changes

beacon_node/lighthouse_network/src/service/mod.rs Outdated Show resolved Hide resolved

Propagate the MetaData response to the Router.

666a8fa

jimmygchen force-pushed the sync-metadata branch from aab502f to 666a8fa Compare May 6, 2025 14:32

AgeManning reviewed May 8, 2025

View reviewed changes

Update peer CGC update routing.

e21590a

jimmygchen requested a review from pawanjay176 May 8, 2025 07:06

jimmygchen added ready-for-review The code is ready for review and removed under-review A reviewer has only partially completed a review. peerdas-devnet-4 labels May 8, 2025

dapplion commented May 8, 2025

View reviewed changes

pawanjay176 approved these changes May 8, 2025

View reviewed changes

pawanjay176 added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 8, 2025

mergify bot added a commit that referenced this pull request May 8, 2025

Merge of #6975

b788c1c

mergify bot mentioned this pull request May 8, 2025

merge queue: embarking unstable (4b9c16f) and #6975 together #7425

Closed

6 tasks

mergify bot added a commit that referenced this pull request May 9, 2025

Merge of #6975

10a19ec

mergify bot mentioned this pull request May 9, 2025

merge queue: embarking unstable (4b9c16f) and #6975 together #7428

Closed

6 tasks

mergify bot merged commit a497ec6 into sigp:unstable May 9, 2025
31 checks passed

jimmygchen mentioned this pull request May 9, 2025

Range sync stuck due to insufficient peers on data column subnets when triggered #6895

Closed

Retry custody requests after peer metadata updates #6975

Retry custody requests after peer metadata updates #6975

Uh oh!

Conversation

dapplion commented Feb 11, 2025

Issue Addressed

Proposed Changes

Uh oh!

jimmygchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

AgeManning left a comment

Choose a reason for hiding this comment

Uh oh!

jimmygchen commented May 8, 2025

Uh oh!

dapplion left a comment

Choose a reason for hiding this comment

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 8, 2025

Uh oh!

jimmygchen commented May 9, 2025

Uh oh!

mergify bot commented May 9, 2025

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

Uh oh!

Uh oh!

Uh oh!