Skip to content

Nvidia/Mellanox expose ROCE ECN information on sysfs on the path #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 3, 2025

Conversation

dasturiasArista
Copy link
Contributor

@dasturiasArista dasturiasArista commented Jan 28, 2025

Nvidia/Mellanox expose ROCE ECN information on sysfs on the path
/sys/class/net/<interface>/ecn/<protocol>/

There are 2 protocols Reaction Point (rp) and Notification point (np)

For each of the protocols they have a list of attributes:
/sys/class/net/<interface>/ecn/<protocol>/params/<requested attribute>

Each protocol will also if ECN is enabled per priority (where X is the
priority):
/sys/class/net/<interface>/ecn/<protocol>/enable/X

This is documented here
https://docs.nvidia.com/networking/display/mlnxofedv571020/explicit+congestion+notification+(ecn)

The attributes are documented here:
https://enterprise-support.nvidia.com/s/article/dcqcn-parameters

/sys/class/net/<interface>/ecn/<protocol>/

There are 2 protocols Reaction Point (rp) and Notification point (np)

For each of the protocols they have a list of attributes:
/sys/class/net/<interface>/ecn/<protocol>/params/<requested attribute>

Each protocol will also if ECN is enabled per priority (where X is the
priority):
/sys/class/net/<interface>/ecn/<protocol>/enable/X

This is documented here
https://docs.nvidia.com/networking/display/mlnxofedv571020/explicit+congestion+notification+(ecn)

The attributes are documented here:
https://enterprise-support.nvidia.com/s/article/dcqcn-parameters

Signed-off-by: Diego Asturias <dasturias@arista.com>
@discordianfish
Copy link
Member

LGTM in general but it makes me wonder where to draw the line what vendor specific stuff to include here and what not.. I feel this is probably relevant enough to be included but not sure. @SuperQ wdyt?

@dasturiasArista
Copy link
Contributor Author

Just to be clear this is just nvidia specific because other vendors aren't exposing these values through Sysfs. these are pretty generic values for ROCEv2. For example https://docs.broadcom.com/doc/NCC-WP1XX has information on broadcom's congestion control, which also uses ECN. Ideally other vendors would also expose these values in a sysfs path instead of relying on propriety utilities. I'm not aware if other vendors that implement rocev2 plan to in the future integrate with sysfs (if they already have), but this seems fairly generic in terms of ROCEv2

@discordianfish
Copy link
Member

@dasturiasArista Thanks for clarifying. I think this is a good argument for including it here.

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ merged commit a5f79dd into prometheus:master Jul 3, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants