[Bug] Can't benchmark deepseek_v2 with dummy weights

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

`bench_offline_throughput` doesn't work with `--load-format=dummy` on `--model-path=deepseek-ai/DeepSeek-V2-Lite`

It works when loading real weights (no `--load-format=dummy`)

### Reproduction


```
python -m sglang.bench_offline_throughput --load-format=dummy --model-path=deepseek-ai/DeepSeek-V2-Lite  --trust-remote-code
```

```
INFO 03-14 00:57:48 __init__.py:190] Automatically detected platform cuda.
[2025-03-14 00:57:52] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-V2-Lite', tokenizer_path='deepseek-ai/DeepSeek-V2-Lite', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='dummy', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-V2-Lite', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=24544728, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
INFO 03-14 00:57:56 __init__.py:190] Automatically detected platform cuda.
INFO 03-14 00:57:56 __init__.py:190] Automatically detected platform cuda.
[2025-03-14 00:58:01 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-14 00:58:01 TP0] Init torch distributed begin.
[2025-03-14 00:58:02 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-03-14 00:58:02 TP0] Load weight begin. avail mem=78.60 GB
[2025-03-14 00:58:02 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-14 00:58:02 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=49.16 GB, mem usage=29.44 GB.
[2025-03-14 00:58:02 TP0] Memory pool end. avail mem=6.89 GB
[2025-03-14 00:58:02 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=6.79 GB
  0%|                                                                                                           | 0/23 [00:00<?, ?it/s]
[2025-03-14 00:58:03 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1714, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 218, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 74, in __init__
    self.model_runner = ModelRunner(
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 166, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 207, in initialize
    self.init_cuda_graphs()
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 881, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 251, in __init__
    self.capture()
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 323, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 402, in capture_one_batch_size
    run_once()
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 395, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 1086, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 1040, in forward
    hidden_states, residual = layer(
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 976, in forward
    hidden_states = self.self_attn(
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 585, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/home/user/sgl-repro/.venv/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 649, in forward_absorb
    if self.w_kc.dtype == torch.float8_e4m3fnuz:
AttributeError: 'NoneType' object has no attribute 'dtype'
```

### Environment

Python: 3.10.14 (main, Jan 18 2025, 03:01:18) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.161.08
PyTorch: 2.5.1+cu124
sglang: 0.4.4
sgl_kernel: 0.0.5
flashinfer: 0.2.3
triton: 3.1.0
transformers: 4.48.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.13
fastapi: 0.115.11
hf_transfer: 0.1.9
huggingface_hub: 0.29.3
interegular: 0.3.3
modelscope: 1.23.2
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.3.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.66.3
tiktoken: 0.9.0
anthropic: 0.49.0
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7  NIC8     CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS   SYS      0-87    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS   SYS      0-87    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS   SYS      0-87    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS   SYS      0-87    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB   PHB      88-175  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB   PHB      88-175  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB   PHB      88-175  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB   PHB      88-175  1               N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS   SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     PHB     SYS     SYS     SYS   SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB      X      PHB     PHB     SYS     SYS     SYS   SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB      X      PHB     SYS     SYS     SYS   SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB      X      SYS     SYS     SYS   SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB   PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB   PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X    PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB    X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


Hypervisor vendor: KVM
ulimit soft: 1024


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Can't benchmark deepseek_v2 with dummy weights #4405

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Can't benchmark deepseek_v2 with dummy weights #4405

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions