Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Mar 22, 2025

Co-authored with @qingquansong

Roadmap Issue is here

Motivation

Support FA3 as attention backend by using fa3's flash_attn_with_kvcache

Conclusion:

  • Prefill Throughput are on par than current baseline
  • Decode Throughput are slightly higher than current baseline
  • Accuracy is slightly better than current baseline
  • Flashinfer will OOM when batch size, input size is large while FA3 won't

What has been supported:

  • MHA Models like Llama/QWen/Gemma
  • Cuda Graph
  • Sliding Window (tested with Gemma 2)

TODO in this PR:

  • Remove the clang format change, now it's blocking our commit for some reason
  • Add check to fail the launch server command if gpu is below Hopper.

Next Steps after this PR:

  • Figure out how to build FA3 into SGLang: @hebiao064 @zhyncs
  • Page Size > 1
  • Support Multimodal
  • Support Speculative Decoding
  • Support MLA for Deepseek-like models
  • Support FP8

Benchmark on Latency

Note: I benchmarked on various of input/output length, the graph is an aggregated view based on batch size, please checkout the sheet for details

Screenshot 2025-03-23 at 10 18 39 PM

Screenshot 2025-03-23 at 10 18 30 PM

Benchmark on Accuracy

GSM 8K: python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-q uestions 1319 --parallel 1319

Model FA3 Accuracy Flash Infer Accuracy
Meta-Llama-3.1-8B-Instruct 0.793 0.789
Qwen2.5-7B-Instruct 0.823 0.789
Gemma-2-9B 0.724 (Torch Native is 0.730) 0.132 (potential bug!)

Modifications

Checklist

@hebiao064 hebiao064 marked this pull request as ready for review March 22, 2025 23:50
@hebiao064 hebiao064 requested a review from merrymercy as a code owner March 22, 2025 23:50
@qingquansong
Copy link
Collaborator

qingquansong commented Mar 23, 2025

Great work. What is the problem when the code captured under cuda graph ? Is there some single test to reproduce it ?

@yiakwy-xpu-ml-framework-team Just added the cuda graph support, tested on single GPU and haven't tested on more gpu yet. Will do more testing later. Thanks!

@qingquansong qingquansong force-pushed the support_fa3_as_attention_backend branch from ba01eb7 to dbb8090 Compare March 23, 2025 06:13
@qingquansong qingquansong force-pushed the support_fa3_as_attention_backend branch 2 times, most recently from 6e1e8e7 to 218495d Compare March 23, 2025 20:28
@qingquansong qingquansong force-pushed the support_fa3_as_attention_backend branch from 218495d to 8a7c328 Compare March 23, 2025 20:33
@hebiao064
Copy link
Collaborator Author

sgl_fa3_cuda_graph (1)
Mark for myself and future maintainers: Add a graph to describe how cuda graph works for FA3 Backend

@hebiao064
Copy link
Collaborator Author

Current benchmark only contains the workload with input=1024 and output=512.

It could be really helpful if benchmark results on more workloads can be provided, including long inputs + short outputs and short inputs + long outputs.

added two graph in PR Description and detailed data is here: https://docs.google.com/spreadsheets/d/14SjCU5Iphf2EsD4cZJqsYKQn8YbPPt0ZA5viba3gB1Y/edit?gid=0#gid=0

@hebiao064
Copy link
Collaborator Author

Moved from PR Description to Comment as a reference:

Benchmark Throughput with random dataset and shared-prefix dataset

SharedGPT Dataset
image

Shared Prefix Dataset
image

@hebiao064 hebiao064 changed the title [WIP] Support FA3 as Attention backend Support FA3 as Attention backend Mar 24, 2025
@hebiao064 hebiao064 changed the title Support FA3 as Attention backend Support FA3 as Attention backend by using --attention-backend fa3 Mar 24, 2025
@zhyncs zhyncs merged commit 5d7edc8 into sgl-project:main Mar 24, 2025
1 of 18 checks passed
@FlamingoPg
Copy link
Collaborator

sgl_fa3_cuda_graph (1) Mark for myself and future maintainers: Add a graph to describe how cuda graph works for FA3 Backend

cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants