Skip to content

[Bug]: Server Validation Error when applying service configuration with ssh fleets in dstack sky #2547

@Bihan

Description

@Bihan

Steps to reproduce

I created dstackssh fleetwith hotaisle's single Mi300x GPU using dstack Sky. Then when I applied service config, validation error occurred.

Steps to reproduce:

  1. dstack apply -f hotaisle.fleet.yml
  2. dstack apply -f .dstack.yml

Configurations:

  1. .dstack.yml
type: service
name: deepseek-r1-amd

image: rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410
env:
  - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  - MAX_MODEL_LEN=126432
commands:
  - vllm serve $MODEL_ID
    --max-model-len $MAX_MODEL_LEN
    --trust-remote-code
port: 8000

model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

volumes:
  - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub

resources:
    gpu: mi300x
    disk: 300Gb..

2.hotaisle.fleet.yml

type: fleet
# The name is optional, if not specified, generated randomly
name: hotaisle-fleet

# Uncomment if instances are interconnected
#placement: cluster

# SSH credentials for the on-prem servers
ssh_config:
  user: hotaisle
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 23.183.40.75

Actual behaviour


[shim.log](https://github.com/user-attachments/files/19850803/shim.log)

$dstack apply -f .dstack.yml
Project          dstack-team-pool                         
 User             Bihan                                    
 Configuration    .dstack.yml                              
 Type             service                                  
 Resources        2..xCPU, 8GB.., 1xmi300x, 300GB.. (disk) 
 Max price        -                                        
 Max duration     -                                        
 Spot policy      on-demand                                
 Retry policy     -                                        
 Creation policy  reuse-or-create                          
 Idle duration    5m                                       
 Reservation      -                                        

 #  BACKEND  REGION  INSTANCE  RESOURCES                                         SPOT  PRICE       
 1  ssh      remote  instance  8xCPU, 220GB, 1xMI300X (192GB), 11149.2GB (disk)  no    $0     idle 

Finished run deepseek-r1-amd already exists.
Override the run? [y/n]: y
Server validation error: 
{'detail': [{'loc': ['body',
                     'plan',
                     'current_resource',
                     'run_spec',
                     'configuration',
                     'ServiceConfigurationRequest',
                     'tags'],
             'msg': 'extra fields not permitted',
             'type': 'value_error.extra'},
            {'loc': ['body',
                     'plan',
                     'current_resource',
                     'run_spec',
                     'profile',
                     'tags'],
             'msg': 'extra fields not permitted',
             'type': 'value_error.extra'},
            {'loc': ['body',
                     'plan',
                     'current_resource',
                     'run_spec',
                     '__root__'],
             'msg': 'Missing configuration',
             'type': 'value_error'}]}

Expected behaviour

Submit the run deepseek-r1-amd? [y/n]: y
 NAME             BACKEND       RESOURCES                                         PRICE  STATUS   SUBMITTED 
 deepseek-r1-amd  ssh (remote)  8xCPU, 220GB, 1xMI300X (192GB), 11451.2GB (disk)  $0.0   running  09:35     

deepseek-r1-amd provisioning completed (running)
Service is published at:
  https://deepseek-r1-amd.dstack-team-pool.sky.dstack.ai/
Model deepseek-ai/DeepSeek-R1-Distill-Llama-70B is published at:
  https://gateway.dstack-team-pool.sky.dstack.ai/

INFO 04-22 03:54:14 [__init__.py:239] Automatically detected platform rocm.
INFO 04-22 03:54:15 [api_server.py:1034] vLLM API server version 0.8.3.dev349+gb8498bc4a
INFO 04-22 03:54:15 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1-Distill-Llama-70B', config='', 
...
...
...
INFO 04-22 03:54:38 [model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Llama-70B...
INFO 04-22 03:54:38 [weight_utils.py:265] Using model weights format ['*.safetensors']
model-00003-of-000017.safetensors: 100% 1.58G/1.58G [00:10<00:00, 149MB/s] 
model-00005-of-000017.safetensors: 100% 8.42G/8.42G [00:55<00:00, 152MB/s] 
model-00007-of-000017.safetensors: 100% 8.42G/8.42G [00:56<00:00, 150MB/s] 
model-00002-of-000017.safetensors: 100% 8.69G/8.69G [00:57<00:00, 151MB/s] 
model-00004-of-000017.safetensors: 100% 8.69G/8.69G [00:57<00:00, 150MB/s] 
model-00006-of-000017.safetensors: 100% 8.69G/8.69G [00:57<00:00, 150MB/s] 
model-00008-of-000017.safetensors: 100% 8.69G/8.69G [00:57<00:00, 150MB/s] 
model-00001-of-000017.safetensors: 100% 8.95G/8.95G [00:58<00:00, 152MB/s] 
model-00009-of-000017.safetensors: 100% 8.42G/8.42G [00:56<00:00, 150MB/s]
model-00011-of-000017.safetensors: 100% 8.42G/8.42G [00:53<00:00, 156MB/s] 
model-00010-of-000017.safetensors: 100% 8.69G/8.69G [00:55<00:00, 157MB/s]
model-00013-of-000017.safetensors: 100% 8.42G/8.42G [00:53<00:00, 158MB/s]
model-00015-of-000017.safetensors: 100% 8.42G/8.42G [00:53<00:00, 157MB/s]
model-00012-of-000017.safetensors: 100% 8.69G/8.69G [00:54<00:00, 160MB/s] 
model-00014-of-000017.safetensors: 100% 8.69G/8.69G [00:54<00:00, 158MB/s]
model-00016-of-000017.safetensors: 100% 8.69G/8.69G [00:54<00:00, 160MB/s]
model-00017-of-000017.safetensors: 100% 10.5G/10.5G [00:58<00:00, 181MB/s] 
...
...

dstack version

Dstack Repo Version: 2102b1b

Server logs

Additional information

Works perfectly well with dstack cli 0.19.4.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions