Skip to content

Incomplete error checking leads to InfiniBand parse failures #704

@mfuller-lambda

Description

@mfuller-lambda

Hello, procfs maintainers! 👋

In a new environment buildout happening at my company, we have observed that node_exporter is failing to collect all InfiniBand metrics from our hosts, with a sole log line this:

time=2025-03-03T19:11:05.508Z level=ERROR source=collector.go:168 msg="collector failed" name=infiniband duration_seconds=0.113365034 err="error obtaining InfiniBand class info: failed to read file \"/sys/class/infiniband/mlx5_0/ports/1/counters/VL15_dropped\": invalid argument"

Initially believing this was a node_exporter issue, I peeled back the code path and discovered that the problem stems from procfs, in particular parseInfiniBandCounters (on L322 here, specifically).

I was actually quite surprised to see that procfs does not leverage os.ReadFile but instead issues a syscall directly. That code path returns a syscall.EINVAL, not os.ErrInvalid. I see the note and reasoning why that is in https://github.com/prometheus/procfs/blob/868112d62466a723be5986bfb288f40670ff98eb/internal/util/sysreadfile.go 👍

In very bleeding-edge configurations and hardware, it's possible that attempting to read certain ConnectX-7 IB device counters will yield this syscall invalid argument.

I have a very naive fix for this I'll PR shortly, but wanted to file an issue first for community awareness and in case my to-be-proposed mitigation is not the direction you'd like to go.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions