-
Notifications
You must be signed in to change notification settings - Fork 186
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Steps to reproduce
- Start a run on Lambda GPU instance
- SSH into the container:
ssh <run-name>
- Run
nvidia-smi
Actual behaviour
Sometimes nvidia-smi
fails with the error: Failed to initialize NVML: Unknown Error
. To reliably trigger the issue:
- SSH into the host
ssh <run-name-host
- Run
sudo systemctl daemon-reload
- Run
nvidia-smi
inside the container again
Expected behaviour
No response
dstack version
0.19.7
Server logs
Additional information
NVIDIA/nvidia-container-toolkit#48
dstack/src/dstack/_internal/core/backends/nebius/compute.py
Lines 44 to 61 in c5d1bd5
DOCKER_DAEMON_CONFIG = { | |
"runtimes": {"nvidia": {"args": [], "path": "nvidia-container-runtime"}}, | |
# Workaround for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 | |
"exec-opts": ["native.cgroupdriver=cgroupfs"], | |
} | |
SETUP_COMMANDS = [ | |
"ufw allow ssh", | |
"ufw allow from 10.0.0.0/8", | |
"ufw allow from 172.16.0.0/12", | |
"ufw allow from 192.168.0.0/16", | |
"ufw default deny incoming", | |
"ufw default allow outgoing", | |
"ufw enable", | |
'sed -i "s/.*AllowTcpForwarding.*/AllowTcpForwarding yes/g" /etc/ssh/sshd_config', | |
"service ssh restart", | |
f"echo {shlex.quote(json.dumps(DOCKER_DAEMON_CONFIG))} > /etc/docker/daemon.json", | |
"service docker restart", | |
] |
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working