Skip to content

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Feb 6, 2025

Proposed Changes

Currently we track a key metric PEERS_PER_COLUMN_SUBNET in a getter good_peers_on_sampling_subnets. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

  • sync_peers_per_column_subnet: Track health of overall subnets in your node
  • sync_peers_per_custody_column_subnet: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
  • sync_column_subnets_with_zero_peers: Is equivalent to the Grafana query count(sync_peers_per_column_subnet == 0) by (instance). We may prefer to skip it, but I believe it's the most important metric as if sync_column_subnets_with_zero_peers > 0 your node stalls.
  • sync_custody_column_subnets_with_zero_peers: count(sync_peers_per_custody_column_subnet == 0) by (instance)

@dapplion dapplion requested a review from jxs as a code owner February 6, 2025 14:55
@dapplion dapplion requested a review from jimmygchen February 6, 2025 14:57
@dapplion dapplion added ready-for-review The code is ready for review das Data Availability Sampling labels Feb 6, 2025
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just added a comment.

"sync_column_subnets_with_zero_peers",
"Current count of total column subnets with zero peers",
)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use sync_peers_per_column_subnet for this?

and the query:

count(sync_peers_per_column_subnet == 0)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mentioned that in the description, we can skip the metric if we are ok doing a Grafana query

"sync_custody_column_subnets_with_zero_peers",
"Current count of custody column subnets of this node with zero peers",
)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, could use sync_column_subnets_with_zero_peers?

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 7, 2025
@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 10, 2025
mergify bot added a commit that referenced this pull request Feb 10, 2025
@michaelsproul
Copy link
Member

Removing from merge queue temporarily while we merge v7 PRs

@michaelsproul michaelsproul added blocked and removed ready-for-merge This PR is ready to merge. labels Feb 10, 2025
@michaelsproul
Copy link
Member

@mergify dequeue

Copy link

mergify bot commented Feb 10, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #6928 has been dequeued by a dequeue command

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link

mergify bot commented Feb 10, 2025

dequeue

✅ The pull request has been removed from the queue default

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed blocked labels Feb 10, 2025
@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Feb 10, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify bot added a commit that referenced this pull request Feb 11, 2025
Copy link

mergify bot commented Feb 11, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Feb 11, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify bot added a commit that referenced this pull request Feb 11, 2025
@mergify mergify bot merged commit d603881 into sigp:unstable Feb 11, 2025
31 checks passed
@dapplion dapplion deleted the peerdas-peer-metrics branch February 11, 2025 16:47
pawanjay176 pushed a commit to pawanjay176/lighthouse that referenced this pull request Feb 21, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
eserilev pushed a commit to eserilev/lighthouse that referenced this pull request Mar 5, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants