-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat: support flashinfer mla attention for deepseek v3 #3550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TLDR for long context use case, the throughput improve 4 times (10526.88/2679.07=3.93) Triton's backend performance in the ShareGPT scenario was acceptable with short prompt lengths, but deteriorated significantly as the prompt length increased. The main purpose of the FlashInfer MLA backend is to solve performance issues when dealing with long prompt lengths. # server
## flashinfer backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --enable-flashinfer-mla --disable-radix-cache --tp 8
## triton backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --disable-radix-cache --tp 8 # client
## random range ratio 0.0, random input 32000, random output 100
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 32000 --random-output 100 --request-rate 1 --num-prompt 60
|
mmlu python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
|
|
Great work! prefix caching too in the other PR, and docker images for 0.4.3.post2 -- trying out! |
Great work! And I wondoer if the flashinfer MLA backenc support deepseek v2.5/v2 or not. @zhyncs |
Does AMD's GPU support the optimization of long-context FlashInfer MLA attention? |
Motivation
Kudos to @yzh119 Throughout the integration process, we have identified and resolved numerous issues with the exceptional support from the FlashInfer team. Currently, SGLang is the first open-source LLM inference engine to incorporate FlashInfer's new MLA Attention into the LLM engine among all frameworks.
ref https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.1
This version should use
--enable-flashinfer-mla --disable-radix-cache
. follow-up updates will include support for prefix cache.For other LLM engines, if you refer to this PR, please include "Adapted from https://github.com/sgl-project/sglang/pull/3550/files", thank you :-)
Modifications
Checklist