feat: support flashinfer mla attention for deepseek v3 #3550

zhyncs · 2025-02-13T15:46:27Z

Motivation

Kudos to @yzh119 Throughout the integration process, we have identified and resolved numerous issues with the exceptional support from the FlashInfer team. Currently, SGLang is the first open-source LLM inference engine to incorporate FlashInfer's new MLA Attention into the LLM engine among all frameworks.

ref https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.1

This version should use --enable-flashinfer-mla --disable-radix-cache. follow-up updates will include support for prefix cache.

For other LLM engines, if you refer to this PR, please include "Adapted from https://github.com/sgl-project/sglang/pull/3550/files", thank you :-)

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/pyproject.toml

zhyncs · 2025-02-13T18:26:49Z

TLDR for long context use case, the throughput improve 4 times (10526.88/2679.07=3.93）

Triton's backend performance in the ShareGPT scenario was acceptable with short prompt lengths, but deteriorated significantly as the prompt length increased. The main purpose of the FlashInfer MLA backend is to solve performance issues when dealing with long prompt lengths.

# server

## flashinfer backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --enable-flashinfer-mla --disable-radix-cache --tp 8

## triton backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --disable-radix-cache --tp 8

# client

## random range ratio 0.0, random input 32000, random output 100
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 32000 --random-output 100 --request-rate 1 --num-prompt 60

# results

## flashinfer
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     60        
Benchmark duration (s):                  89.26     
Total input tokens:                      936860    
Total generated tokens:                  2790      
Total generated tokens (retokenized):    2759      
Request throughput (req/s):              0.67      
Input token throughput (tok/s):          10495.63  
Output token throughput (tok/s):         31.26     
Total token throughput (tok/s):          10526.88  
Concurrency:                             27.30     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40610.86  
Median E2E Latency (ms):                 40717.92  
---------------Time to First Token----------------
Mean TTFT (ms):                          11720.58  
Median TTFT (ms):                        8502.60   
P99 TTFT (ms):                           31531.74  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          747.50    
Median TPOT (ms):                        684.83    
P99 TPOT (ms):                           1808.22   
---------------Inter-token Latency----------------
Mean ITL (ms):                           634.72    
Median ITL (ms):                         183.78    
P99 ITL (ms):                            7371.57   
==================================================

## triton
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     60        
Benchmark duration (s):                  350.74    
Total input tokens:                      936860    
Total generated tokens:                  2790      
Total generated tokens (retokenized):    2769      
Request throughput (req/s):              0.17      
Input token throughput (tok/s):          2671.12   
Output token throughput (tok/s):         7.95      
Total token throughput (tok/s):          2679.07   
Concurrency:                             46.35     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   270925.53 
Median E2E Latency (ms):                 300357.01 
---------------Time to First Token----------------
Mean TTFT (ms):                          151196.95 
Median TTFT (ms):                        152912.14 
P99 TTFT (ms):                           300844.83 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4438.80   
Median TPOT (ms):                        2765.61   
P99 TPOT (ms):                           38287.21  
---------------Inter-token Latency----------------
Mean ITL (ms):                           2630.43   
Median ITL (ms):                         79.59     
P99 ITL (ms):                            30336.83  
==================================================

zhyncs · 2025-02-13T19:12:09Z

mmlu

python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000

subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.859
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.860
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.660
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.780
subject: college_medicine, #q:173, acc: 0.861
subject: college_physics, #q:102, acc: 0.824
subject: computer_security, #q:100, acc: 0.900
subject: conceptual_physics, #q:235, acc: 0.945
subject: econometrics, #q:114, acc: 0.781
subject: electrical_engineering, #q:145, acc: 0.869
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.817
subject: global_facts, #q:100, acc: 0.720
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.873
subject: high_school_geography, #q:198, acc: 0.949
subject: high_school_government_and_politics, #q:193, acc: 0.974
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.759
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.956
subject: high_school_statistics, #q:216, acc: 0.847
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.941
subject: human_aging, #q:223, acc: 0.861
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.902
subject: machine_learning, #q:112, acc: 0.839
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.957
subject: medical_genetics, #q:100, acc: 0.970
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.855
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.920
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.718
subject: professional_medicine, #q:272, acc: 0.945
subject: professional_psychology, #q:612, acc: 0.902
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.861
subject: sociology, #q:201, acc: 0.940
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 580.525
Average accuracy: 0.874

HaiShaw · 2025-02-14T05:10:15Z

ITL seems hurt a little bit, any insight?
Do longer --output 1000 to check maybe?

pseudotensor · 2025-02-20T01:47:04Z

Great work! prefix caching too in the other PR, and docker images for 0.4.3.post2 -- trying out!

liangzelang · 2025-02-20T03:44:12Z

Great work! And I wondoer if the flashinfer MLA backenc support deepseek v2.5/v2 or not. @zhyncs

dinggh · 2025-03-10T14:21:58Z

Does AMD's GPU support the optimization of long-context FlashInfer MLA attention?

zhyncs requested review from merrymercy, Ying1123, hnyls2002, ispobock and ByronHsu as code owners February 13, 2025 15:46

zhyncs commented Feb 13, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

zhyncs assigned yzh119, ispobock and zhyncs Feb 13, 2025

zhyncs force-pushed the zhyncs/mla branch from e66ca66 to 5e935b9 Compare February 13, 2025 16:13

upd

d6bb29c

zhyncs force-pushed the zhyncs/mla branch from 35b2cd6 to d6bb29c Compare February 13, 2025 16:40

upd

ea4d7a9

zhyncs force-pushed the zhyncs/mla branch from d3c9b17 to ea4d7a9 Compare February 13, 2025 23:27

zhyncs added 4 commits February 13, 2025 23:28

upd

7dc609b

Merge branch 'main' into zhyncs/mla

e4ddf8e

fix

295f8d5

fix

a620913

zhyncs force-pushed the zhyncs/mla branch from 7843507 to a620913 Compare February 14, 2025 00:49

zhyncs merged commit 70f894b into main Feb 14, 2025
3 of 18 checks passed

zhyncs deleted the zhyncs/mla branch February 14, 2025 00:50

ZJLi2013 mentioned this pull request Feb 14, 2025

[Bug] latest flashinferv0.2.1.whl is not compatible with sglang>=v0.4.3 whl #3570

Closed

5 tasks

zhyncs mentioned this pull request Feb 17, 2025

feat: support flashinfer mla with prefix cache #3643

Merged

6 tasks

Fridge003 mentioned this pull request Feb 27, 2025

[Question] Optimization Options for DeepSeek-r1 Implementation #3906

Closed

WineChord mentioned this pull request Mar 17, 2025

[Bug] fix gemma-2-2b-it-FP8 accuracy #4324

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support flashinfer mla attention for deepseek v3 #3550

feat: support flashinfer mla attention for deepseek v3 #3550

Uh oh!

zhyncs commented Feb 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

zhyncs commented Feb 13, 2025

Uh oh!

zhyncs commented Feb 13, 2025

Uh oh!

Uh oh!

HaiShaw commented Feb 14, 2025

Uh oh!

pseudotensor commented Feb 20, 2025 •

edited

Loading

Uh oh!

liangzelang commented Feb 20, 2025

Uh oh!

dinggh commented Mar 10, 2025

Uh oh!

Uh oh!

feat: support flashinfer mla attention for deepseek v3 #3550

feat: support flashinfer mla attention for deepseek v3 #3550

Uh oh!

Conversation

zhyncs commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

zhyncs commented Feb 13, 2025

Uh oh!

zhyncs commented Feb 13, 2025

Uh oh!

Uh oh!

HaiShaw commented Feb 14, 2025

Uh oh!

pseudotensor commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangzelang commented Feb 20, 2025

Uh oh!

dinggh commented Mar 10, 2025

Uh oh!

Uh oh!

zhyncs commented Feb 13, 2025 •

edited

Loading

pseudotensor commented Feb 20, 2025 •

edited

Loading