Support FA3 as Attention backend by using `--attention-backend fa3` #4680

hebiao064 · 2025-03-22T21:46:07Z

Co-authored with @qingquansong

Roadmap Issue is here

Motivation

Support FA3 as attention backend by using fa3's flash_attn_with_kvcache

Conclusion:

Prefill Throughput are on par than current baseline
Decode Throughput are slightly higher than current baseline
Accuracy is slightly better than current baseline
Flashinfer will OOM when batch size, input size is large while FA3 won't

What has been supported:

MHA Models like Llama/QWen/Gemma
Cuda Graph
Sliding Window (tested with Gemma 2)

TODO in this PR:

Remove the clang format change, now it's blocking our commit for some reason
Add check to fail the launch server command if gpu is below Hopper.

Next Steps after this PR:

Figure out how to build FA3 into SGLang: @hebiao064 @zhyncs
Page Size > 1
Support Multimodal
Support Speculative Decoding
Support MLA for Deepseek-like models
Support FP8

Benchmark on Latency

Note: I benchmarked on various of input/output length, the graph is an aggregated view based on batch size, please checkout the sheet for details

Benchmark on Accuracy

GSM 8K: python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-q uestions 1319 --parallel 1319

Model	FA3 Accuracy	Flash Infer Accuracy
Meta-Llama-3.1-8B-Instruct	0.793	0.789
Qwen2.5-7B-Instruct	0.823	0.789
Gemma-2-9B	0.724 (Torch Native is 0.730)	0.132 (potential bug!)

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

format

.pre-commit-config.yaml

qingquansong · 2025-03-23T06:11:05Z

Great work. What is the problem when the code captured under cuda graph ? Is there some single test to reproduce it ?

@yiakwy-xpu-ml-framework-team Just added the cuda graph support, tested on single GPU and haven't tested on more gpu yet. Will do more testing later. Thanks!

python/sglang/srt/layers/attention/flashattention_backend.py

hebiao064 · 2025-03-23T21:09:34Z

Mark for myself and future maintainers: Add a graph to describe how cuda graph works for FA3 Backend

hebiao064 · 2025-03-24T00:21:54Z

Current benchmark only contains the workload with input=1024 and output=512.

It could be really helpful if benchmark results on more workloads can be provided, including long inputs + short outputs and short inputs + long outputs.

added two graph in PR Description and detailed data is here: https://docs.google.com/spreadsheets/d/14SjCU5Iphf2EsD4cZJqsYKQn8YbPPt0ZA5viba3gB1Y/edit?gid=0#gid=0

hebiao064 · 2025-03-24T00:35:35Z

Moved from PR Description to Comment as a reference:

Benchmark Throughput with random dataset and shared-prefix dataset

SharedGPT Dataset

Shared Prefix Dataset

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/layers/attention/flashattention_backend.py

FlamingoPg · 2025-03-27T11:26:48Z

Mark for myself and future maintainers: Add a graph to describe how cuda graph works for FA3 Backend

cool！

hebiao064 and others added 19 commits March 19, 2025 22:27

Support FA3 Attention Backend as optional

f075831

init commit

e7bdbe3

fix

d7fd383

Update

d4a1d41

add test

2f20eb8

minor fix

f061a34

fix

65a8821

fix test decode len

0d85f62

fix prefix

28021bc

add test for extend with prefix

8ea44dd

format

speed up

3ec8008

vectorize and clean to speedup

009d5ca

fix

c4b8432

use batch kv cache

a49d2af

update decode kv cache

28f43c8

add metadata

9629522

add prefill metadata

afb5dce

unifiy fa3 api

34d9696

Benchmark on Shared Prefix Dataset

98bb8cf

hebiao064 mentioned this pull request Mar 22, 2025

Support fa3 as attention backend hebiao064/sglang#6

Closed

6 tasks

zhyncs added the high priority label Mar 22, 2025

zhyncs assigned merrymercy, ispobock and zhyncs Mar 22, 2025

Merge branch 'main' into support_fa3_as_attention_backend

450d879

zhyncs assigned Ying1123 and Fridge003 Mar 22, 2025

hebiao064 commented Mar 22, 2025

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

hebiao064 marked this pull request as ready for review March 22, 2025 23:50

hebiao064 requested a review from merrymercy as a code owner March 22, 2025 23:50

add cuda graph support

dbb8090

qingquansong force-pushed the support_fa3_as_attention_backend branch from ba01eb7 to dbb8090 Compare March 23, 2025 06:13

zhyncs added the performance label Mar 23, 2025

ispobock reviewed Mar 23, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

qingquansong force-pushed the support_fa3_as_attention_backend branch 2 times, most recently from 6e1e8e7 to 218495d Compare March 23, 2025 20:28

modify sliding window size

8a7c328

qingquansong force-pushed the support_fa3_as_attention_backend branch from 218495d to 8a7c328 Compare March 23, 2025 20:33

Address test comment

8d21531

hebiao064 changed the title ~~[WIP] Support FA3 as Attention backend~~ Support FA3 as Attention backend Mar 24, 2025

hebiao064 and others added 4 commits March 24, 2025 05:40

address comment

b2028ac

Merge branch 'main' into support_fa3_as_attention_backend

7dbbace

upd

edafb7a

fix

915020b

hebiao064 mentioned this pull request Mar 24, 2025

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Closed

15 tasks

zhyncs reviewed Mar 24, 2025

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Show resolved Hide resolved

rename to fa3

3a07064

hebiao064 changed the title ~~Support FA3 as Attention backend~~ Support FA3 as Attention backend by using --attention-backend fa3 Mar 24, 2025

zhyncs merged commit 5d7edc8 into sgl-project:main Mar 24, 2025
1 of 18 checks passed

merrymercy reviewed Mar 25, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

hebiao064 mentioned this pull request Mar 25, 2025

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

Merged

6 tasks

RoadToNowhereX mentioned this pull request Apr 28, 2025

Attention Backend turboderp-org/exllamav3#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support FA3 as Attention backend by using `--attention-backend fa3` #4680

Support FA3 as Attention backend by using `--attention-backend fa3` #4680

Uh oh!

hebiao064 commented Mar 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

qingquansong commented Mar 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 23, 2025

Uh oh!

hebiao064 commented Mar 24, 2025

Uh oh!

hebiao064 commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Mar 27, 2025

Uh oh!

Uh oh!

Support FA3 as Attention backend by using --attention-backend fa3 #4680

Support FA3 as Attention backend by using --attention-backend fa3 #4680

Uh oh!

Conversation

hebiao064 commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark on Latency

Benchmark on Accuracy

Modifications

Checklist

Uh oh!

Uh oh!

qingquansong commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 23, 2025

Uh oh!

hebiao064 commented Mar 24, 2025

Uh oh!

hebiao064 commented Mar 24, 2025

Benchmark Throughput with random dataset and shared-prefix dataset

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Mar 27, 2025

Uh oh!

Uh oh!

Support FA3 as Attention backend by using `--attention-backend fa3` #4680

Support FA3 as Attention backend by using `--attention-backend fa3` #4680

hebiao064 commented Mar 22, 2025 •

edited

Loading

qingquansong commented Mar 23, 2025 •

edited

Loading