[Bug]: Lambda: `nvidia-smi` sometimes fails with `Failed to initialize NVML: Unknown Error`

### Steps to reproduce

1. Start a run on Lambda GPU instance
2. SSH into the container: `ssh <run-name>`
3. Run `nvidia-smi`

### Actual behaviour

Sometimes `nvidia-smi` fails with the error: `Failed to initialize NVML: Unknown Error`. To reliably trigger the issue:

1. SSH into the host `ssh <run-name-host`
2. Run `sudo systemctl daemon-reload`
3. Run `nvidia-smi` inside the container again

### Expected behaviour

_No response_

### dstack version

0.19.7

### Server logs

```shell

```

### Additional information

https://github.com/NVIDIA/nvidia-container-toolkit/issues/48

https://github.com/dstackai/dstack/blob/c5d1bd58e68002cb24fc0cb218918ef2d2b1313c/src/dstack/_internal/core/backends/nebius/compute.py#L44-L61

	DOCKER_DAEMON_CONFIG = {
	"runtimes": {"nvidia": {"args": [], "path": "nvidia-container-runtime"}},
	# Workaround for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
	"exec-opts": ["native.cgroupdriver=cgroupfs"],
	}
	SETUP_COMMANDS = [
	"ufw allow ssh",
	"ufw allow from 10.0.0.0/8",
	"ufw allow from 172.16.0.0/12",
	"ufw allow from 192.168.0.0/16",
	"ufw default deny incoming",
	"ufw default allow outgoing",
	"ufw enable",
	'sed -i "s/.AllowTcpForwarding./AllowTcpForwarding yes/g" /etc/ssh/sshd_config',
	"service ssh restart",
	f"echo {shlex.quote(json.dumps(DOCKER_DAEMON_CONFIG))} > /etc/docker/daemon.json",
	"service docker restart",
	]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Lambda: `nvidia-smi` sometimes fails with `Failed to initialize NVML: Unknown Error` #2601

Steps to reproduce

Actual behaviour

Expected behaviour

dstack version

Server logs

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Lambda: nvidia-smi sometimes fails with Failed to initialize NVML: Unknown Error #2601

Description

Steps to reproduce

Actual behaviour

Expected behaviour

dstack version

Server logs

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Lambda: `nvidia-smi` sometimes fails with `Failed to initialize NVML: Unknown Error` #2601