Skip to content

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Feb 13, 2025

Motivation

Kudos to @yzh119 Throughout the integration process, we have identified and resolved numerous issues with the exceptional support from the FlashInfer team. Currently, SGLang is the first open-source LLM inference engine to incorporate FlashInfer's new MLA Attention into the LLM engine among all frameworks.

ref https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.1

This version should use --enable-flashinfer-mla --disable-radix-cache. follow-up updates will include support for prefix cache.

For other LLM engines, if you refer to this PR, please include "Adapted from https://github.com/sgl-project/sglang/pull/3550/files", thank you :-)

Modifications

Checklist

@zhyncs
Copy link
Member Author

zhyncs commented Feb 13, 2025

TLDR for long context use case, the throughput improve 4 times (10526.88/2679.07=3.93)

Triton's backend performance in the ShareGPT scenario was acceptable with short prompt lengths, but deteriorated significantly as the prompt length increased. The main purpose of the FlashInfer MLA backend is to solve performance issues when dealing with long prompt lengths.

# server

## flashinfer backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --enable-flashinfer-mla --disable-radix-cache --tp 8

## triton backend
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote-code --disable-radix-cache --tp 8
# client

## random range ratio 0.0, random input 32000, random output 100
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 32000 --random-output 100 --request-rate 1 --num-prompt 60
# results

## flashinfer
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     60        
Benchmark duration (s):                  89.26     
Total input tokens:                      936860    
Total generated tokens:                  2790      
Total generated tokens (retokenized):    2759      
Request throughput (req/s):              0.67      
Input token throughput (tok/s):          10495.63  
Output token throughput (tok/s):         31.26     
Total token throughput (tok/s):          10526.88  
Concurrency:                             27.30     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40610.86  
Median E2E Latency (ms):                 40717.92  
---------------Time to First Token----------------
Mean TTFT (ms):                          11720.58  
Median TTFT (ms):                        8502.60   
P99 TTFT (ms):                           31531.74  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          747.50    
Median TPOT (ms):                        684.83    
P99 TPOT (ms):                           1808.22   
---------------Inter-token Latency----------------
Mean ITL (ms):                           634.72    
Median ITL (ms):                         183.78    
P99 ITL (ms):                            7371.57   
==================================================

## triton
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     60        
Benchmark duration (s):                  350.74    
Total input tokens:                      936860    
Total generated tokens:                  2790      
Total generated tokens (retokenized):    2769      
Request throughput (req/s):              0.17      
Input token throughput (tok/s):          2671.12   
Output token throughput (tok/s):         7.95      
Total token throughput (tok/s):          2679.07   
Concurrency:                             46.35     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   270925.53 
Median E2E Latency (ms):                 300357.01 
---------------Time to First Token----------------
Mean TTFT (ms):                          151196.95 
Median TTFT (ms):                        152912.14 
P99 TTFT (ms):                           300844.83 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4438.80   
Median TPOT (ms):                        2765.61   
P99 TPOT (ms):                           38287.21  
---------------Inter-token Latency----------------
Mean ITL (ms):                           2630.43   
Median ITL (ms):                         79.59     
P99 ITL (ms):                            30336.83  
==================================================

@zhyncs
Copy link
Member Author

zhyncs commented Feb 13, 2025

mmlu

python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.859
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.860
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.660
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.780
subject: college_medicine, #q:173, acc: 0.861
subject: college_physics, #q:102, acc: 0.824
subject: computer_security, #q:100, acc: 0.900
subject: conceptual_physics, #q:235, acc: 0.945
subject: econometrics, #q:114, acc: 0.781
subject: electrical_engineering, #q:145, acc: 0.869
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.817
subject: global_facts, #q:100, acc: 0.720
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.873
subject: high_school_geography, #q:198, acc: 0.949
subject: high_school_government_and_politics, #q:193, acc: 0.974
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.759
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.956
subject: high_school_statistics, #q:216, acc: 0.847
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.941
subject: human_aging, #q:223, acc: 0.861
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.902
subject: machine_learning, #q:112, acc: 0.839
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.957
subject: medical_genetics, #q:100, acc: 0.970
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.855
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.920
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.718
subject: professional_medicine, #q:272, acc: 0.945
subject: professional_psychology, #q:612, acc: 0.902
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.861
subject: sociology, #q:201, acc: 0.940
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 580.525
Average accuracy: 0.874

@zhyncs zhyncs merged commit 70f894b into main Feb 14, 2025
3 of 18 checks passed
@zhyncs zhyncs deleted the zhyncs/mla branch February 14, 2025 00:50
@HaiShaw
Copy link
Collaborator

HaiShaw commented Feb 14, 2025

ITL seems hurt a little bit, any insight?
Do longer --output 1000 to check maybe?

@pseudotensor
Copy link

pseudotensor commented Feb 20, 2025

Great work! prefix caching too in the other PR, and docker images for 0.4.3.post2 -- trying out!

@liangzelang
Copy link

Great work! And I wondoer if the flashinfer MLA backenc support deepseek v2.5/v2 or not. @zhyncs

@dinggh
Copy link

dinggh commented Mar 10, 2025

Does AMD's GPU support the optimization of long-context FlashInfer MLA attention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants