Skip to content

Conversation

dapplion
Copy link
Collaborator

Issue Addressed

Closes #6895

We need sync to retry custody requests when a peer CGC updates. A higher CGC can result in a data column subnet peer count increasing from 0 to 1, allowing requests to happen.

Proposed Changes

Add new sync event SyncMessage::UpdatedPeerCgc. It's sent by the router when a metadata response updates the known CGC

@dapplion dapplion requested a review from jxs as a code owner February 11, 2025 02:48
@dapplion dapplion requested a review from jimmygchen February 11, 2025 02:48
@dapplion dapplion added ready-for-review The code is ready for review das Data Availability Sampling labels Feb 11, 2025
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me! I think we need to attempt to make progress on range sync as well?

@@ -483,6 +486,13 @@ impl<T: BeaconChainTypes> SyncManager<T> {
}
}

fn updated_peer_cgc(&mut self, _peer_id: PeerId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to resume by range request as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I resume range sync aswell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll run some tests today to confirm this fixes the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this fixes the issue. I still don't see range requests until a few minutes later until we add a new finalized chain.

Right after startup, waiting for custody peers

Feb 12 04:33:07.181 DEBG Waiting for peers to be available on sampling column subnets, chain: 1, service: range_sync, service: sync, module: network::sync::range_sync::chain:1057

Got peer metadata response after 15s

Feb 12 04:33:22.271 DEBG Obtained peer's metadata, new_seq_no: 6, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompkoge8T2x4KFH4KbzDkh7zz6uN2JX, service: libp2p, module: lighthouse_network::peer_manager:732

No range requests until ~5 mins later

Feb 12 04:38:07.303 DEBG Finalization sync peer joined, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompkoge8T2x4KFH4KbzDkh7zz6uN2JX, service: range_sync, service: sync, module: network::sync::range_sync::range:143
Feb 12 04:38:07.305 DEBG New chain added to sync, id: 2, from: 38, to: 1071, end_root: 0x3be00d7ce6e52f7938fd588d909055f72469d0f09ce545d7b23077f2d6b40e8a, current_target: 38, batches: 0, peers: 1, state: Stopped, sync_type: Finalized, peer_id: 16Uiu2HAmJyaVGkRGR9ACqompk
oge8T2x4KFH4KbzDkh7zz6uN2JX, service: range_sync, service: sync, module: network::sync::range_sync::chain_collection:506
Feb 12 04:38:07.306 DEBG Sync RPC request sent, id: 4/3/RangeSync/39/1, peer: 16Uiu2HAmAAZ5wP6fvpe1b9tWNmgA2Wn8MsrNYVhkw5WohcAoaHKR, epoch: 39, slots: 32, method: BlocksByRange, service: sync, module: network::sync::network_context:788
Feb 12 04:38:07.307 DEBG Sync RPC request sent, id: 5/3/RangeSync/39/1, peer: 16Uiu2HAm9PijSZpm5QUphXRoBtkhUZPkGJ4Rgxk4Bny91oZPYZLG, columns: [74, 30, 39, 19, 63, 41, 52, 47, 58], epoch: 39, slots: 32, method: DataColumnsByRange, service: sync, module: network::sync::network_context:870

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you saw this log? Received updated peer CGC message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall seeing this. I think I was using the right locally-built image, but can be worth re-testing to confirm if you have time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion could you retest this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've retested this, and it doesn't seem to trigger retry after obtaining the peers metadata, and the Received updated peer CGC message log was not observed.

May 06 07:16:46.854 DEBUG Waiting for peers to be available on custody column subnets chain: 1service: "range_sync"
...
# obtained peers metadata
May 06 07:17:01.389 DEBUG Obtained peer's metadata                      peer_id: 16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW, new_seq_no: 6, service: "network"
May 06 07:17:01.393 DEBUG Obtained peer's metadata                      peer_id: 16Uiu2HAmKjMmG6VJG2oi5Gd7x4iszsv4Fm9JZWeEad62kTVF9dke, new_seq_no: 132, service: "network"
...
# no range request until ~5 mins later when a new chain is added
May 06 07:21:46.397 DEBUG New chain added to sync                       peer_id: "16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW", sync_type: Finalized, id: 5, start_epoch: 0, target_head_slot: 193, target_head_root: 0xc1802494a0935c8bee449a412f982cc1903ec791565d
9cf9eaab233f2cd2bc90, component: "range_sync"
May 06 07:21:46.397 DEBUG Sync RPC request sent                         method: "DataColumnsByRange", slots: 32, epoch: 1, columns: [8, 69, 99, 90, 9, 84, 24, 71, 12, 121, 50, 68, 127, 66, 36, 26, 41, 107, 47, 52, 108, 98, 70, 100, 54, 35, 21, 60, 49, 120, 10, 72, 44, 93, 67, 0, 122, 19, 43, 62, 97, 115, 59, 95, 48, 11, 101, 116, 34, 20, 25, 18, 3, 124, 110, 57, 46, 83, 77, 105, 81, 106, 1, 45, 104, 22, 29, 53, 6, 32, 13, 119, 4, 78, 63, 86, 92, 7, 114, 73, 16, 17, 109, 28, 75, 102, 80, 40, 79, 51, 125, 38, 85, 89, 42, 39, 126, 113, 87, 96, 37, 88, 64, 112, 14, 118, 76, 117, 58, 82, 27, 94, 74, 23, 30, 111, 15, 2, 123, 33, 55, 5, 65, 91, 31, 56], peer: 16Uiu2HAmKjMmG6VJG2oi5Gd7x4iszsv4Fm9JZWeEad62kTVF9dke, id: 8/6/RangeSync/1/1, chain: 1, service: "range_sync"
May 06 07:21:46.397 DEBUG Sync RPC request sent                         method: "DataColumnsByRange", slots: 32, epoch: 1, columns: [61, 103], peer: 16Uiu2HAkz7SLRbDFNscs6RFDrD37oB5yCR9UaBcVpAiy1G54BuQW, id: 9/6/RangeSync/1/1, chain: 1, service: "range_sync"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. peerdas-devnet-4 and removed ready-for-review The code is ready for review labels Feb 11, 2025
@dapplion dapplion added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 11, 2025
@dapplion dapplion requested a review from jimmygchen February 11, 2025 17:00
@jimmygchen jimmygchen added under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Feb 12, 2025
Copy link

mergify bot commented Mar 10, 2025

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed under-review A reviewer has only partially completed a review. labels Apr 1, 2025
@jimmygchen jimmygchen self-assigned this May 6, 2025
@jimmygchen jimmygchen added the under-review A reviewer has only partially completed a review. label May 6, 2025
@jimmygchen jimmygchen removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 6, 2025
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see here, you only care if the custody columns get updated.

I think we should try and maintain all our internal consistencies, i.e Internal Requests stay internal else it can get very confusing.

The solution you have here has redundant cases, like if its not updated, we send the event up to the router which then throws it away.

To keep everything consistent, I think we should do the following:

  1. Add a new event to NetworkEvent i.e
    NetworkEvent::PeerUpdatedCustodyGroupCount(PeerId)
  2. In the handling of a metadata response, if the cgc is updated, then return the above event
  3. Handle this event and send to sync

This has the benefits of:

  1. Not sending redundant messages with redundant information (i.e the original metadata)
  2. We maintain the logic that internal messages stay internal
  3. I think its a bit clearer the purpose of the message and why its there.

@jimmygchen jimmygchen requested a review from pawanjay176 May 8, 2025 07:06
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed under-review A reviewer has only partially completed a review. peerdas-devnet-4 labels May 8, 2025
@jimmygchen
Copy link
Member

Thanks @pawanjay176 @AgeManning ! yeah make sense, I've updated the code as per your suggestions. I think it's much cleaner.

Copy link
Collaborator Author

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ This approach looks much cleaner! Thanks @AgeManning

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@pawanjay176 pawanjay176 added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 8, 2025
mergify bot added a commit that referenced this pull request May 8, 2025
Copy link

mergify bot commented May 8, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented May 9, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify bot added a commit that referenced this pull request May 9, 2025
@mergify mergify bot merged commit a497ec6 into sigp:unstable May 9, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants