Skip to content

[Bug]: Lambda: nvidia-smi sometimes fails with Failed to initialize NVML: Unknown Error #2601

@un-def

Description

@un-def

Steps to reproduce

  1. Start a run on Lambda GPU instance
  2. SSH into the container: ssh <run-name>
  3. Run nvidia-smi

Actual behaviour

Sometimes nvidia-smi fails with the error: Failed to initialize NVML: Unknown Error. To reliably trigger the issue:

  1. SSH into the host ssh <run-name-host
  2. Run sudo systemctl daemon-reload
  3. Run nvidia-smi inside the container again

Expected behaviour

No response

dstack version

0.19.7

Server logs

Additional information

NVIDIA/nvidia-container-toolkit#48

DOCKER_DAEMON_CONFIG = {
"runtimes": {"nvidia": {"args": [], "path": "nvidia-container-runtime"}},
# Workaround for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
"exec-opts": ["native.cgroupdriver=cgroupfs"],
}
SETUP_COMMANDS = [
"ufw allow ssh",
"ufw allow from 10.0.0.0/8",
"ufw allow from 172.16.0.0/12",
"ufw allow from 192.168.0.0/16",
"ufw default deny incoming",
"ufw default allow outgoing",
"ufw enable",
'sed -i "s/.*AllowTcpForwarding.*/AllowTcpForwarding yes/g" /etc/ssh/sshd_config',
"service ssh restart",
f"echo {shlex.quote(json.dumps(DOCKER_DAEMON_CONFIG))} > /etc/docker/daemon.json",
"service docker restart",
]

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions