Skip to content

[Bug] FA3 KV-Cache-Fp8 #5651

@pengcuo

Description

@pengcuo

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

block_shape=[128, 128].json for W8A8 Block FP8 ke[2025-[2025-04-23 02:05:08 TP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1470, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1394, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1181, in forward
    return self.forward_ffn_with_full_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1210, in forward_ffn_with_full_input
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 632, in forward
    return self.forward_absorb(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 745, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 97, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 68, in forward
    return self.forward_decode(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashattention_backend.py", line 1055, in forward_decode
    result = flash_attn_with_kvcache(
  File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/flash_attn.py", line 170, in flash_attn_with_kvcache
    out, softmax_lse, *rest = torch.ops.sgl_kernel.fwd.default(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
RuntimeError: HeaddimV > 256 requires fp16 and bf16 data type

Reproduction

nohup python3 -m sglang.launch_server \
--model-path /path_to_DeepSeek-R1 \
--tp 16 \
--dist-init-addr xxxx:20000 \
--nnodes 2 \
--node-rank 0 \
--trust-remote-code \
--host 0.0.0.0 \
--schedule-policy fcfs \
--chunked-prefill-size 32768 \
--max-running-requests 16 \
--disable-overlap-schedule \
--port 8080 \
--disable-radix-cache  \
--mem-fraction-static 0.8 \
--attention-backend fa3 \
--kv-cache-dtype fp8_e4m3

Environment

sgl v0.4.5.post3

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions