-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
============= error msg =======
[2025-03-26 14:26:06 DP3 TP12] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP2 TP8] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP2 TP9] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.76 GB
[2025-03-26 14:26:06 DP3 TP13] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP3 TP15] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.09 GB
[2025-03-26 14:26:06 DP2 TP11] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP3 TP14] Capture draft cuda graph begin. This can take up to several minutes. avail mem=15.75 GB
[2025-03-26 14:26:06 DP2 TP10] Capture draft cuda graph begin. This can take up to several minutes. avail mem=16.10 GB
[2025-03-26 14:26:08 DP2 TP9] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1972, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 256, in init
self.draft_worker = EAGLEWorker(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 143, in init
self.init_cuda_graphs()
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 207, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in init
self.capture()
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 103, in capture
CudaGraphRunner.capture(self)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 348, in capture
) = self.capture_one_batch_size(bs, forward)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 166, in capture_one_batch_size
run_once()
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 156, in run_once
ret = self.eagle_worker.draft_forward(forward_batch)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 401, in draft_forward
logits_output = self.draft_model_runner.model.forward(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_nextn.py", line 159, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_nextn.py", line 110, in forward
hidden_states, residual = self.decoder(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 1085, in forward
dp_gather_partial(hidden_states, local_hidden_states, forward_batch)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 221, in dp_gather_partial
_dp_gather(global_tokens, local_tokens, forward_batch, is_partial=True)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 187, in _dp_gather
local_start_pos, local_num_tokens = get_dp_local_info(forward_batch)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/dp_attention.py", line 127, in get_dp_local_info
cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0)
TypeError: cumsum() received an invalid combination of arguments - got (NoneType, dim=int), but expected one of:
- (Tensor input, int dim, *, torch.dtype dtype = None, Tensor out = None)
- (Tensor input, name dim, *, torch.dtype dtype = None, Tensor out = None)
Reproduction
--- node0
python -m sglang.launch_server --served-model-name auto --port 11086 --model-path ${MODEL_PATH} --tp 16 --dist-init-addr ${DRIVER_IP}:50000 --nnodes 2 --node-rank 0 --trust-remote-code --disable-radix-cache --max-running-requests 130 --cuda-graph-bs 1 4 8 16 32 64 96 128 --cuda-graph-max-bs 130 --speculative-algo NEXTN --speculative-draft $MTP_MODEL --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 4
--- node1
python -m sglang.launch_server --served-model-name auto --port 11086 --model-path ${MODEL_PATH} --tp 16 --dist-init-addr ${DRIVER_IP}:50000 --nnodes 2 --node-rank 1 --trust-remote-code --disable-radix-cache --max-running-requests 130 --cuda-graph-bs 1 4 8 16 32 64 96 128 --cuda-graph-max-bs 130 --speculative-algo NEXTN --speculative-draft $MTP_MODEL --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 4
Environment
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.61
CUDA Driver Version: 535.183.06
PyTorch: 2.7.0.dev20250303+cu128
sglang: 0.4.4.post1
sgl_kernel: 0.0.5.post3
flashinfer: 0.2.3
triton: 3.2.0
transformers: 4.46.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: Module Not Found
huggingface_hub: 0.26.5
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.6
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.7.2.post1.dev11+g0e33649b5.d20250305
openai: 1.58.1
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
ulimit soft: 1048576