[Bug] FA3 KV-Cache-Fp8

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

```
block_shape=[128, 128].json for W8A8 Block FP8 ke[2025-[2025-04-23 02:05:08 TP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 359, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 451, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 444, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1470, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1394, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1181, in forward
    return self.forward_ffn_with_full_input(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1210, in forward_ffn_with_full_input
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 632, in forward
    return self.forward_absorb(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 745, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 97, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 68, in forward
    return self.forward_decode(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashattention_backend.py", line 1055, in forward_decode
    result = flash_attn_with_kvcache(
  File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/flash_attn.py", line 170, in flash_attn_with_kvcache
    out, softmax_lse, *rest = torch.ops.sgl_kernel.fwd.default(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
RuntimeError: HeaddimV > 256 requires fp16 and bf16 data type
```



### Reproduction

```
nohup python3 -m sglang.launch_server \
--model-path /path_to_DeepSeek-R1 \
--tp 16 \
--dist-init-addr xxxx:20000 \
--nnodes 2 \
--node-rank 0 \
--trust-remote-code \
--host 0.0.0.0 \
--schedule-policy fcfs \
--chunked-prefill-size 32768 \
--max-running-requests 16 \
--disable-overlap-schedule \
--port 8080 \
--disable-radix-cache  \
--mem-fraction-static 0.8 \
--attention-backend fa3 \
--kv-cache-dtype fp8_e4m3

```

### Environment

sgl v0.4.5.post3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] FA3 KV-Cache-Fp8 #5651

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] FA3 KV-Cache-Fp8 #5651

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions