-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Labels
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
When use --speculative-algo NEXTN, server got OOM after run about 30min, error log:
opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
[2025-02-17 09:44:25 TP15] Scheduler hit an exception: Traceback (most recent call last):
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
result = self.run_batch(batch)
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/kesgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 143, in forward_batch_speculative_generation
logits_output, next_token_ids = self.target_worker.forward_batch_generation(
File "/kesgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
return self.forward_extend(forward_batch)
File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
return self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 871, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 832, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 787, in forward
hidden_states = self.mlp(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
return fused_experts(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
fused_experts_impl(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 946, in fused_experts_impl
intermediate_cache3 = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 7 has a total capacity of 95.00 GiB of which 638.31 MiB is free. Process 2047768 has 94.37 GiB memory in use. Of the allocat
ed memory 86.76 GiB is allocated by PyTorch, with 3.50 GiB allocated in private pools (e.g., CUDA Graphs), and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is la
rge try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variable
s)
Reproduction
I use 2 H20*8 run model deepseek-r1, rank0 server start cmd:
python3 -m sglang.launch_server --model $MODEL_PATH --tp 16 --nccl-init-addr $SERVER_HOST:$SERVER_PORT --port 8000 --host 0.0.0.0 --nnodes 2 --node-rank 0 --trust-remote-code --speculative-algo NEXTN --speculative-draft $MTP_MODEL_PATH --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --watchdog-timeout 1800 --enable-torch-compile --torch-compile-max-bs 2 --log-requests --served-model-name deepseek-r1 --mem-fraction-static 0.9
Environment
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.07
PyTorch: 2.5.1+cu124
sglang: 0.4.3
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 23.1
psutil: 6.1.1
pydantic: 2.10.1
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PHB PIX NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX PHB 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE PHB PIX 96-191,288-383 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX PHB NODE SYS SYS SYS SYS NODE X PHB NODE SYS SYS SYS SYS
NIC2 NODE PHB PIX NODE SYS SYS SYS SYS NODE PHB X NODE SYS SYS SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS NODE NODE X PHB
NIC7 SYS SYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS NODE NODE PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
NIC4: mlx5_bond_4
NIC5: mlx5_bond_5
NIC6: mlx5_bond_6
NIC7: mlx5_bond_7
ulimit soft: 1048576