[Bug] DP + MTP init failed with deepseek r1

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.

### Describe the bug

============= error msg =======

[2025-03-26 14:26:06 DP3 TP12] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP2 TP8] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP2 TP9] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.76 GB
[2025-03-26 14:26:06 DP3 TP13] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP3 TP15] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP2 TP11] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP3 TP14] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP2 TP10] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.10 GB
[2025-03-26 14:26:08 DP2 TP9] Scheduler hit an exception: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1972, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 256, in __init__
    self.draft_worker = EAGLEWorker(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 143, in __init__
    self.init_cuda_graphs()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 207, in init_cuda_graphs
    self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in __init__
    self.capture()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 103, in capture
    CudaGraphRunner.capture(self)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 348, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 166, in capture_one_batch_size
    run_once()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 156, in run_once
    ret = self.eagle_worker.draft_forward(forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 401, in draft_forward
    logits_output = self.draft_model_runner.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_nextn.py", line 159, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_nextn.py", line 110, in forward
    hidden_states, residual = self.decoder(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 1085, in forward
    dp_gather_partial(hidden_states, local_hidden_states, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 221, in dp_gather_partial
    _dp_gather(global_tokens, local_tokens, forward_batch, is_partial=True)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 187, in _dp_gather
    local_start_pos, local_num_tokens = get_dp_local_info(forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 127, in get_dp_local_info
    cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0)
TypeError: cumsum() received an invalid combination of arguments - got (NoneType, dim=int), but expected one of:
 * (Tensor input, int dim, *, torch.dtype dtype = None, Tensor out = None)
 * (Tensor input, name dim, *, torch.dtype dtype = None, Tensor out = None)

### Reproduction

--- node0
python -m sglang.launch_server --served-model-name auto --port 11086 --model-path ${MODEL_PATH} --tp 16 --dist-init-addr ${DRIVER_IP}:50000 --nnodes 2 --node-rank 0 --trust-remote-code --disable-radix-cache  --max-running-requests 130 --cuda-graph-bs 1 4 8 16 32 64 96 128 --cuda-graph-max-bs 130 --speculative-algo NEXTN --speculative-draft $MTP_MODEL --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 4 

--- node1
python -m sglang.launch_server --served-model-name auto --port 11086 --model-path ${MODEL_PATH} --tp 16 --dist-init-addr ${DRIVER_IP}:50000 --nnodes 2 --node-rank 1 --trust-remote-code --disable-radix-cache  --max-running-requests 130 --cuda-graph-bs 1 4 8 16 32 64 96 128 --cuda-graph-max-bs 130 --speculative-algo NEXTN --speculative-draft $MTP_MODEL --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 4

### Environment

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.61
CUDA Driver Version: 535.183.06
PyTorch: 2.7.0.dev20250303+cu128
sglang: 0.4.4.post1
sgl_kernel: 0.0.5.post3
flashinfer: 0.2.3
triton: 3.2.0
transformers: 4.46.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: Module Not Found
huggingface_hub: 0.26.5
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.6
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.7.2.post1.dev11+g0e33649b5.d20250305
openai: 1.58.1
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    PIX     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3


ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] DP + MTP init failed with deepseek r1 #4783

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DP + MTP init failed with deepseek r1 #4783

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions