[Bug] Deepseek EP + DP Fail and Accuracy Crush

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

Environment: Nvidia H200 * 8 (single node)
Docker Image: lmsysorg/sglang:dev-deepep ; lmsysorg/sglang:v0.4.4.post1-cu125 (For testing EPMoE. The newer version cannot will break with --enable-ep-moe )
CUDA: release 12.4, V12.4.131
Model:  deepseek-ai/DeepSeek-R1
Case: MLA: DP=8, MoE: EP=8

**Server:**
EPMoE: 
`SGL_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --tp 8 --ep-size 8 --enable-ep-moe --enable-torch-compile --moe-dense-tp-size 1 --disable-cuda-graph --enable-dp-attention --dp-size 8`

also happened in DeepEP cases in newest docker: lmsysorg/sglang:dev-deepep
DeepEP:
`SGL_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --tp 8 --ep-size 8 --enable-deepep-moe --deepep-mode normal --enable-torch-compile --moe-dense-tp-size 1 --disable-cuda-graph --enable-dp-attention --dp-size 8 `

**Client:**
`python3 benchmark/gsm8k/bench_sglang.py --port 30000 --parallel 1400 --num-questions 1400`

Result:
Accuracy: **0.015**
Invalid: 0.173
Latency: 115.351 s
Output throughput: 5255.786 token/s

Compare to result without **--enable-dp-attention --dp-size 8** :
Accuracy: 0.956
Invalid: 0.000
Latency: 136.963 s
Output throughput: 929.664 token/s

Server Error:
EPMoE:
[2025-06-10 07:42:53 DP5 TP5] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 946, in forward
    return self.forward_decode(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 897, in forward_decode
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1079, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1041, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 980, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 202, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 620, in forward
    final_hidden_states = self.quant_method.apply(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 928, in apply
    return fused_experts(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 924, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 793, in inplace_fused_experts
    fused_experts_impl(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1076, in fused_experts_impl
    sorted_token_ids, expert_ids, num_tokens_post_padded = moe_align_block_size(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 453, in moe_align_block_size
    sorted_ids.fill_(topk_ids.numel())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 111, in forward_thread_func
    with torch.get_device_module(self.device).stream(self.forward_stream):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
    torch.cuda.set_stream(self.src_prev_stream)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 962, in _patched_set_stream
    prev_set_stream(stream)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
    _set_stream_by_id(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
    torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank5]:[E610 07:42:53.345100746 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9077cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9077c636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9077da5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f902dc25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f902dc2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f902dc31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f902dc3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f907815e5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f911b3f8ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f911b489a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9077cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9077c636e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9077da5a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f902dc25726 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f902dc2a3f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f902dc31b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f902dc3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f907815e5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x94ac3 (0x7f911b3f8ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f911b489a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9077cb9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f902d8a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f907815e5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7f911b3f8ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f911b489a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f8bd4fc9640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8bd47c8640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1400 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f8bd37c6640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f911b363480 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 174 in resolve_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 190 in process_batch_result_decode
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1247 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 533 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1757 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>


### Reproduction

Docker Image: 
lmsysorg/sglang:dev-deepep (For testing DeepEP)
lmsysorg/sglang:v0.4.4.post1-cu125 (For testing EPMoE. The newer version cannot will break with --enable-ep-moe )
CUDA: release 12.4, V12.4.131
Model:  deepseek-ai/DeepSeek-R1
Case: MLA: DP=8, MoE: EP=8

**Server:**
EPMoE: 
`SGL_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --tp 8 --ep-size 8 --enable-ep-moe --enable-torch-compile --moe-dense-tp-size 1 --disable-cuda-graph **--enable-dp-attention --dp-size 8**`

also happened in DeepEP cases in newest docker: lmsysorg/sglang:dev-deepep
DeepEP:
`SGL_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --tp 8 --ep-size 8 --enable-deepep-moe --deepep-mode normal --enable-torch-compile --moe-dense-tp-size 1 --disable-cuda-graph **--enable-dp-attention --dp-size 8** `

**Client:**
**python3 benchmark/gsm8k/bench_sglang.py --port 30000 --parallel 1400 --num-questions 1400**


### Environment

Environment: Nvidia H200 * 8 (single node)
Docker Image: 
lmsysorg/sglang:dev-deepep  (For testing DeepEP); lmsysorg/sglang:v0.4.4.post1-cu125 (For testing EPMoE. The newer version cannot will break with --enable-ep-moe )
CUDA: release 12.4, V12.4.131
Model:  deepseek-ai/DeepSeek-R1
Case: MLA: DP=8, MoE: EP=8


NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     48-95,144-191   1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS
NIC8    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_7
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_bond_0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions