-
Notifications
You must be signed in to change notification settings - Fork 186
Description
Steps to reproduce
Environment
There are two devices:
- Local pc that has dstack server and fleet running.
- Remote tenstorrent server(Quitebox Wormhole) running on Ubuntu Server and can be connected through SSH.
Local pc has dstack server running and have these configurations:
Fleet configuration:
fleet.dstack.yml :
type: fleet
name: wormwhole-fleet
ssh_config:
user: ....
identity_file: ......
port: .....
hosts:
- .....
Development environment:
.dstack.yml:
type: dev-environment
name: cursor
image: ghcr.io/tenstorrent/tt-inference-server/tt-metal-qwen25-72b-deepseekr1-llama3-70b-src-base-vllm:hf-llama-b9564bf
ide: vscode
resources:
gpu: n300:8
I am running dstack apply -f .dstack.yml
and I got the following error:
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 ssh (remote) cpu=32 mem=503GB disk=3336GB n300:24GB:8 instance $0 idle
Submit the run cursor? [y/n]: y
NAME BACKEND RESOURCES INSTANCE TYPE PRICE STATUS SUBMITTED ERROR
cursor ssh (remote) cpu=32 mem=503GB disk=3336GB n300:24GB:8 instance $0 exited (None) 12:02
cursor provisioning completed (failed)
Exited (none)
Check dstack logs -d cursor for more details.
When I checked dstack logs -d cursor
, it did not show anything. So, I checked /root/.dstack/shim.log on the tesntorrent server and it showed: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory
.
Actual behaviour
The application is supposed to create an environment using the specified number of tesntorrent cards. I guess this /dev/tenstorrent/4
shows the cards that need to be specified.
Expected behaviour
Since /dev/tenstorrent/4
is not allowing to use the specified cards, I think implementaiton of addressing cards might be an issue. During my recent implementation of LLM in specified cards of tenstorrent, using /dev/tenstorrent/4
also generated a similar error. So, I used tt-smi
's bus_id to specify the cards, it felt like it worked. But when I checked the usage of cards in monitoring the server, I found that all cards are being used, so this method also did not work. Further investigation is needed on this. This is the tt-smi snapshot(I shortened specifying bus_id):
{
"time": "2025-06-11T03:37:23.927792",
"host_info": {
"OS": "Linux",
"Distro": "Ubuntu 22.04.5 LTS",
"Kernel": "5.15.0-141-generic",
"Hostname": "TT-QuietBox",
"Platform": "x86_64",
"Python": "3.10.12",
"Memory": "503.45 GB",
"Driver": "TT-KMD 1.34"
},
"host_sw_vers": {
"tt_smi": "3.0.20",
"pyluwen": "0.7.2"
},
"device_info": [
{
"smbus_telem": {
..
},
"board_info": {
"bus_id": "0000:c1:00.0",
"board_type": "n300 L",
"board_id": "10001451172208f",
"coords": "(1, 0, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": 4,
"pcie_width": "16"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "0000:01:00.0",
"board_type": "n300 L",
"board_id": "100014511722053",
"coords": "(1, 1, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": 4,
"pcie_width": "16"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "0000:02:00.0",
"board_type": "n300 L",
"board_id": "10001451172209c",
"coords": "(2, 1, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": 4,
"pcie_width": "16"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "0000:41:00.0",
"board_type": "n300 L",
"board_id": "100014511722058",
"coords": "(2, 0, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": 4,
"pcie_width": "16"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "N/A",
"board_type": "n300 R",
"board_id": "10001451172208f",
"coords": "(0, 0, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": "N/A",
"pcie_width": "N/A"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "N/A",
"board_type": "n300 R",
"board_id": "100014511722053",
"coords": "(0, 1, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": "N/A",
"pcie_width": "N/A"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "N/A",
"board_type": "n300 R",
"board_id": "10001451172209c",
"coords": "(3, 1, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": "N/A",
"pcie_width": "N/A"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
},
{
"smbus_telem": {
...
},
"board_info": {
"bus_id": "N/A",
"board_type": "n300 R",
"board_id": "100014511722058",
"coords": "(3, 0, 0, 0)",
"dram_status": true,
"dram_speed": "12G",
"pcie_speed": "N/A",
"pcie_width": "N/A"
},
"telemetry": {
...
},
"firmwares": {
...
},
"limits": {
...
}
}
]
}
dstack version
0.19.12
Server logs
logs for /root/.dstack/shim.log on tenstorrent server:
time=2025-06-11T03:02:19.771053Z level=debug status=200 method=POST endpoint=/api/tasks
time=2025-06-11T03:02:19.771141Z level=debug msg=acquired GPU(s) task=ba0b0939-adef-4378-9a7f-00469476b6e6 gpus=[2 3 4 5 6 7 0 1]
time=2025-06-11T03:02:19.771856Z level=debug msg=Preparing volumes
time=2025-06-11T03:02:19.771894Z level=debug msg=Pulling image
time=2025-06-11T03:02:19.777264Z level=debug msg=Creating container task=ba0b0939-adef-4378-9a7f-00469476b6e6 name=cursor-0-0-780389b9
time=2025-06-11T03:02:19.87606Z level=debug msg=Running container task=ba0b0939-adef-4378-9a7f-00469476b6e6 name=cursor-0-0-780389b9
time=2025-06-11T03:02:19.936285Z level=error msg=failed to run container err=Error response from daemon: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory
time=2025-06-11T03:02:19.93972Z level=debug msg=released GPU(s) gpus=[2 3 4 5 6 7 0 1] task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:19.939793Z level=error msg=failed to run err=Error response from daemon: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:20.919138Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:25.422853Z level=debug status=200 method=GET endpoint=/api/healthcheck
time=2025-06-11T03:02:25.44999Z level=debug endpoint=/api/tasks/ba0b0939-adef-4378-9a7f-00469476b6e6 status=200 method=GET
time=2025-06-11T03:02:27.670201Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:28.791535Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:28.818306Z level=debug msg=locked task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:28.818348Z level=debug msg=terminating task=ba0b0939-adef-4378-9a7f-00469476b6e6
Additional information
No response