-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
block_shape=[128, 128].json for W8A8 Block FP8 ke[2025-[2025-04-23 02:05:08 TP15] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1470, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1394, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1181, in forward
return self.forward_ffn_with_full_input(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1210, in forward_ffn_with_full_input
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 632, in forward
return self.forward_absorb(
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 745, in forward_absorb
attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 97, in forward
return forward_batch.attn_backend.forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 68, in forward
return self.forward_decode(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashattention_backend.py", line 1055, in forward_decode
result = flash_attn_with_kvcache(
File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/flash_attn.py", line 170, in flash_attn_with_kvcache
out, softmax_lse, *rest = torch.ops.sgl_kernel.fwd.default(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in __call__
return self._op(*args, **kwargs)
RuntimeError: HeaddimV > 256 requires fp16 and bf16 data type
Reproduction
nohup python3 -m sglang.launch_server \
--model-path /path_to_DeepSeek-R1 \
--tp 16 \
--dist-init-addr xxxx:20000 \
--nnodes 2 \
--node-rank 0 \
--trust-remote-code \
--host 0.0.0.0 \
--schedule-policy fcfs \
--chunked-prefill-size 32768 \
--max-running-requests 16 \
--disable-overlap-schedule \
--port 8080 \
--disable-radix-cache \
--mem-fraction-static 0.8 \
--attention-backend fa3 \
--kv-cache-dtype fp8_e4m3
Environment
sgl v0.4.5.post3
Metadata
Metadata
Assignees
Labels
No labels