Skip to content

[Bug]: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory #2787

@manproai

Description

@manproai

Steps to reproduce

Environment

There are two devices:

  1. Local pc that has dstack server and fleet running.
  2. Remote tenstorrent server(Quitebox Wormhole) running on Ubuntu Server and can be connected through SSH.

Local pc has dstack server running and have these configurations:

Fleet configuration:

fleet.dstack.yml :

type: fleet
name: wormwhole-fleet

ssh_config:
  user: ....
  identity_file: ......
  port: .....

  hosts:
    - .....

Development environment:

.dstack.yml:

type: dev-environment
name: cursor


image: ghcr.io/tenstorrent/tt-inference-server/tt-metal-qwen25-72b-deepseekr1-llama3-70b-src-base-vllm:hf-llama-b9564bf

ide: vscode

resources:
  gpu: n300:8

I am running dstack apply -f .dstack.yml and I got the following error:

 #  BACKEND       RESOURCES                                 INSTANCE TYPE  PRICE       
 1  ssh (remote)  cpu=32 mem=503GB disk=3336GB n300:24GB:8  instance       $0     idle 

Submit the run cursor? [y/n]: y
 NAME    BACKEND       RESOURCES                                 INSTANCE TYPE  PRICE  STATUS         SUBMITTED  ERROR 
 cursor  ssh (remote)  cpu=32 mem=503GB disk=3336GB n300:24GB:8  instance       $0     exited (None)  12:02            

cursor provisioning completed (failed)
Exited (none)
Check dstack logs -d cursor for more details.

When I checked dstack logs -d cursor, it did not show anything. So, I checked /root/.dstack/shim.log on the tesntorrent server and it showed: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory.

Actual behaviour

The application is supposed to create an environment using the specified number of tesntorrent cards. I guess this /dev/tenstorrent/4 shows the cards that need to be specified.

Expected behaviour

Since /dev/tenstorrent/4 is not allowing to use the specified cards, I think implementaiton of addressing cards might be an issue. During my recent implementation of LLM in specified cards of tenstorrent, using /dev/tenstorrent/4 also generated a similar error. So, I used tt-smi's bus_id to specify the cards, it felt like it worked. But when I checked the usage of cards in monitoring the server, I found that all cards are being used, so this method also did not work. Further investigation is needed on this. This is the tt-smi snapshot(I shortened specifying bus_id):

{
    "time": "2025-06-11T03:37:23.927792",
    "host_info": {
        "OS": "Linux",
        "Distro": "Ubuntu 22.04.5 LTS",
        "Kernel": "5.15.0-141-generic",
        "Hostname": "TT-QuietBox",
        "Platform": "x86_64",
        "Python": "3.10.12",
        "Memory": "503.45 GB",
        "Driver": "TT-KMD 1.34"
    },
    "host_sw_vers": {
        "tt_smi": "3.0.20",
        "pyluwen": "0.7.2"
    },
    "device_info": [
        {
            "smbus_telem": {
               ..
            },
            "board_info": {
                "bus_id": "0000:c1:00.0",
                "board_type": "n300 L",
                "board_id": "10001451172208f",
                "coords": "(1, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": "16"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
               ...
            },
            "board_info": {
                "bus_id": "0000:01:00.0",
                "board_type": "n300 L",
                "board_id": "100014511722053",
                "coords": "(1, 1, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": "16"
            },
            "telemetry": {
               ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "0000:02:00.0",
                "board_type": "n300 L",
                "board_id": "10001451172209c",
                "coords": "(2, 1, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": "16"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "0000:41:00.0",
                "board_type": "n300 L",
                "board_id": "100014511722058",
                "coords": "(2, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": "16"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "10001451172208f",
                "coords": "(0, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "100014511722053",
                "coords": "(0, 1, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "10001451172209c",
                "coords": "(3, 1, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
               ...
            },
            "firmwares": {
               ...
            },
            "limits": {
               ...
            }
        },
        {
            "smbus_telem": {
                ...
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "100014511722058",
                "coords": "(3, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
                ...
            },
            "firmwares": {
                ...
            },
            "limits": {
                ...
            }
        }
    ]
}

dstack version

0.19.12

Server logs

logs for /root/.dstack/shim.log on tenstorrent server:

time=2025-06-11T03:02:19.771053Z level=debug status=200 method=POST endpoint=/api/tasks
time=2025-06-11T03:02:19.771141Z level=debug msg=acquired GPU(s) task=ba0b0939-adef-4378-9a7f-00469476b6e6 gpus=[2 3 4 5 6 7 0 1]
time=2025-06-11T03:02:19.771856Z level=debug msg=Preparing volumes
time=2025-06-11T03:02:19.771894Z level=debug msg=Pulling image
time=2025-06-11T03:02:19.777264Z level=debug msg=Creating container task=ba0b0939-adef-4378-9a7f-00469476b6e6 name=cursor-0-0-780389b9
time=2025-06-11T03:02:19.87606Z level=debug msg=Running container task=ba0b0939-adef-4378-9a7f-00469476b6e6 name=cursor-0-0-780389b9
time=2025-06-11T03:02:19.936285Z level=error msg=failed to run container err=Error response from daemon: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory
time=2025-06-11T03:02:19.93972Z level=debug msg=released GPU(s) gpus=[2 3 4 5 6 7 0 1] task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:19.939793Z level=error msg=failed to run err=Error response from daemon: error gathering device information while adding custom device "/dev/tenstorrent/4": no such file or directory task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:20.919138Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:25.422853Z level=debug status=200 method=GET endpoint=/api/healthcheck
time=2025-06-11T03:02:25.44999Z level=debug endpoint=/api/tasks/ba0b0939-adef-4378-9a7f-00469476b6e6 status=200 method=GET
time=2025-06-11T03:02:27.670201Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:28.791535Z level=debug method=GET endpoint=/api/healthcheck status=200
time=2025-06-11T03:02:28.818306Z level=debug msg=locked task=ba0b0939-adef-4378-9a7f-00469476b6e6
time=2025-06-11T03:02:28.818348Z level=debug msg=terminating task=ba0b0939-adef-4378-9a7f-00469476b6e6

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions