Skip to content

[Bug] fused_moe OOM when run deepseek-r1 with --speculative-algo NEXTN #3633

@Lzhang-hub

Description

@Lzhang-hub

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When use --speculative-algo NEXTN, server got OOM after run about 30min, error log:

opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
[2025-02-17 09:44:25 TP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 143, in forward_batch_speculative_generation
    logits_output, next_token_ids = self.target_worker.forward_batch_generation(
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
    return self.forward_extend(forward_batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 871, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 832, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 787, in forward
    hidden_states = self.mlp(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
    final_hidden_states = self.quant_method.apply(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
    return fused_experts(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
    fused_experts_impl(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 946, in fused_experts_impl
    intermediate_cache3 = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 7 has a total capacity of 95.00 GiB of which 638.31 MiB is free. Process 2047768 has 94.37 GiB memory in use. Of the allocat
ed memory 86.76 GiB is allocated by PyTorch, with 3.50 GiB allocated in private pools (e.g., CUDA Graphs), and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is la
rge try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variable
s)

Reproduction

I use 2 H20*8 run model deepseek-r1, rank0 server start cmd:

python3 -m sglang.launch_server --model $MODEL_PATH --tp 16 --nccl-init-addr $SERVER_HOST:$SERVER_PORT --port 8000 --host 0.0.0.0 --nnodes 2 --node-rank 0 --trust-remote-code --speculative-algo NEXTN --speculative-draft $MTP_MODEL_PATH --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --watchdog-timeout 1800 --enable-torch-compile --torch-compile-max-bs 2  --log-requests --served-model-name deepseek-r1 --mem-fraction-static 0.9

Environment

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.07
PyTorch: 2.5.1+cu124
sglang: 0.4.3
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 23.1
psutil: 6.1.1
pydantic: 2.10.1
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     96-191,288-383  1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     NODE     X      PHB     NODE    SYS     SYS     SYS     SYS
NIC2    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     NODE    PHB      X      NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     SYS     SYS     SYS     SYS     NODE    NODE     X      PHB
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     SYS     SYS     SYS     SYS     NODE    NODE    PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7


ulimit soft: 1048576

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions