-
Notifications
You must be signed in to change notification settings - Fork 349
Description
Hello, procfs maintainers! 👋
In a new environment buildout happening at my company, we have observed that node_exporter
is failing to collect all InfiniBand metrics from our hosts, with a sole log line this:
time=2025-03-03T19:11:05.508Z level=ERROR source=collector.go:168 msg="collector failed" name=infiniband duration_seconds=0.113365034 err="error obtaining InfiniBand class info: failed to read file \"/sys/class/infiniband/mlx5_0/ports/1/counters/VL15_dropped\": invalid argument"
Initially believing this was a node_exporter issue, I peeled back the code path and discovered that the problem stems from procfs, in particular parseInfiniBandCounters
(on L322 here, specifically).
I was actually quite surprised to see that procfs does not leverage os.ReadFile
but instead issues a syscall directly. That code path returns a syscall.EINVAL
, not os.ErrInvalid
. I see the note and reasoning why that is in https://github.com/prometheus/procfs/blob/868112d62466a723be5986bfb288f40670ff98eb/internal/util/sysreadfile.go 👍
In very bleeding-edge configurations and hardware, it's possible that attempting to read certain ConnectX-7 IB device counters will yield this syscall invalid argument
.
I have a very naive fix for this I'll PR shortly, but wanted to file an issue first for community awareness and in case my to-be-proposed mitigation is not the direction you'd like to go.