-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support FA3 as Attention backend by using --attention-backend fa3
#4680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FA3 as Attention backend by using --attention-backend fa3
#4680
Conversation
@yiakwy-xpu-ml-framework-team Just added the cuda graph support, tested on single GPU and haven't tested on more gpu yet. Will do more testing later. Thanks! |
ba01eb7
to
dbb8090
Compare
6e1e8e7
to
218495d
Compare
218495d
to
8a7c328
Compare
added two graph in PR Description and detailed data is here: https://docs.google.com/spreadsheets/d/14SjCU5Iphf2EsD4cZJqsYKQn8YbPPt0ZA5viba3gB1Y/edit?gid=0#gid=0 |
--attention-backend fa3
Co-authored with @qingquansong
Roadmap Issue is here
Motivation
Support FA3 as attention backend by using fa3's
flash_attn_with_kvcache
Conclusion:
What has been supported:
TODO in this PR:
Next Steps after this PR:
Benchmark on Latency
Note: I benchmarked on various of input/output length, the graph is an aggregated view based on batch size, please checkout the sheet for details
Benchmark on Accuracy
GSM 8K:
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-q uestions 1319 --parallel 1319
Modifications
Checklist