[Bug] FMHA using flashinfer cutlass on Blackwell has low accuracy result

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

When setting BatchPrefillWithRaggedKVCacheWrapper backend to "cutlass" in flashinfer backend, the test result for Llama-3.1-8B-Instruct is low:
```
Accuracy: 0.018
Invalid: 0.110
Latency: 45.625 s
Output throughput: 12136.862 token/s
```
Triton backend result with the same test:
```
Accuracy: 0.788
Invalid: 0.001
Latency: 16.626 s
Output throughput: 8002.240 token/s
```
According to @yzh119 , inserting a synchronization before run will resolve the issue. Since the overhead should be bypassed, further modification is needed.

### Reproduction

Flashinfer built from source on latest main.
Set BatchPrefillWithRaggedKVCacheWrapper backend to "cutlass" in flashinfer_backend.py.
```
        self.prefill_wrapper_ragged = BatchPrefillWithRaggedKVCacheWrapper(
            self.workspace_buffer, "NHD", backend="cutlass"
        )
```
Run server with:
`python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --trust-remote   --attention-backend flashinfer`
Run test with:
`python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319`


### Environment

Python: 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.41
CUDA Driver Version: 570.148.08
PyTorch: 2.7.0+cu128
sglang: 0.4.6.post5
sgl_kernel: 0.1.5
flashinfer_python: 0.2.5
triton: 3.3.0
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.1.2
aiohttp: 3.12.6
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.3
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.83.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] FMHA using flashinfer cutlass on Blackwell has low accuracy result #6906

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] FMHA using flashinfer cutlass on Blackwell has low accuracy result #6906

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions