feat: Support DP Attention for step3_vl #8699

yhyang201 · 2025-08-02T06:57:38Z

Motivation

Support DP Attention for step3_vl

Modifications

In the implementation prior to step3_vl, DP Attention was already supported for the LLM.

This update extends DP Attention support to the vision model component of the VLM.

In vision.py, world_size and tp_size have been replaced with attn_tp_size. This change does not affect VLM models that previously used standard tensor parallelism.

For the vision model in step3_vl, when DP Attention is enabled:

The Attention component uses SGLang's DP Attention logic.
For the FFN component, since the vision model is a dense architecture, tensor parallelism is applied within each Attention DP Group—unlike the LLM's FFN, which performs TP across the full world size.

In other words, each Attn TP Group (i.e., DP Group) loads a full copy of the vision model.

Please note that this setup may not reflect best practice, and is intended solely to ensure that DP Attention can function correctly in step3_vl.

Accuracy Test

Benchmark & Profiling

 python -m sglang.launch_server --model-path stepfun-ai/step3-fp8 --enable-multimodal --enable-dp-attention --tp 8 --dp 8 --trust-remote-code --mem-fraction-static 0.8


curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  77.48
Total input tokens:                      502493
Total generated tokens:                  499251
Total generated tokens (retokenized):    497840
Request throughput (req/s):              12.91
Input token throughput (tok/s):          6485.58
Output token throughput (tok/s):         6443.74
Total token throughput (tok/s):          12929.32
Concurrency:                             665.05
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51526.79
Median E2E Latency (ms):                 54199.91
---------------Time to First Token----------------
Mean TTFT (ms):                          9328.53
Median TTFT (ms):                        9057.79
P99 TTFT (ms):                           17728.52
---------------Inter-Token Latency----------------
Mean ITL (ms):                           84.69
Median ITL (ms):                         72.46
P95 ITL (ms):                            85.48
P99 ITL (ms):                            102.15
Max ITL (ms):                            16096.83
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name mmmu --num-prompts 500 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     498
Benchmark duration (s):                  50.05
Total input tokens:                      33237
Total generated tokens:                  498000
Total generated tokens (retokenized):    494981
Request throughput (req/s):              9.95
Input token throughput (tok/s):          664.08
Output token throughput (tok/s):         9950.15
Total token throughput (tok/s):          10614.24
Concurrency:                             496.84
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   49933.01
Median E2E Latency (ms):                 49931.56
---------------Time to First Token----------------
Mean TTFT (ms):                          1342.71
Median TTFT (ms):                        1437.51
P99 TTFT (ms):                           1816.03
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.65
Median ITL (ms):                         48.42
P95 ITL (ms):                            52.76
P99 ITL (ms):                            57.66
Max ITL (ms):                            1130.78
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file res.jsonl 
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file res.jsonl

python3 test/srt/parse_results.py res.jsonl


+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |             43.247 |              43.247 |        200.871 |          200.299 |       207.934 |         22.943 |           22.952 |        23.044 |                43.247 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            161.095 |             161.095 |        433.940 |          379.289 |       546.825 |         24.416 |           24.439 |        24.697 |                40.274 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |            499.001 |             499.001 |        645.112 |          692.549 |       787.564 |         31.444 |           31.355 |        31.839 |                31.188 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |            947.501 |             947.501 |        889.175 |          891.117 |      1325.534 |         32.909 |           32.872 |        33.537 |                29.609 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+



========================================
 python -m sglang.launch_server --model-path stepfun-ai/step3-fp8 --enable-multimodal --tp 8 --trust-remote-code --mem-fraction-static 0.8


curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  248.04
Total input tokens:                      502493
Total generated tokens:                  499251
Total generated tokens (retokenized):    498015
Request throughput (req/s):              4.03
Input token throughput (tok/s):          2025.83
Output token throughput (tok/s):         2012.75
Total token throughput (tok/s):          4038.58
Concurrency:                             599.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   148667.98
Median E2E Latency (ms):                 162750.46
---------------Time to First Token----------------
Mean TTFT (ms):                          64073.61
Median TTFT (ms):                        33524.83
P99 TTFT (ms):                           175645.98
---------------Inter-Token Latency----------------
Mean ITL (ms):                           169.78
Median ITL (ms):                         141.00
P95 ITL (ms):                            294.74
P99 ITL (ms):                            308.06
Max ITL (ms):                            7152.99
==================================================




curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name mmmu --num-prompts 500 --random-output 1000


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     498
Benchmark duration (s):                  171.23
Total input tokens:                      33237
Total generated tokens:                  498000
Total generated tokens (retokenized):    494559
Request throughput (req/s):              2.91
Input token throughput (tok/s):          194.11
Output token throughput (tok/s):         2908.37
Total token throughput (tok/s):          3102.48
Concurrency:                             425.13
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   146174.79
Median E2E Latency (ms):                 141497.03
---------------Time to First Token----------------
Mean TTFT (ms):                          17318.98
Median TTFT (ms):                        1609.99
P99 TTFT (ms):                           145094.11
---------------Inter-Token Latency----------------
Mean ITL (ms):                           129.02
Median ITL (ms):                         137.79
P95 ITL (ms):                            149.56
P99 ITL (ms):                            197.43
Max ITL (ms):                            34289.14
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file res.jsonl 
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file res.jsonl

python3 test/srt/parse_results.py res.jsonl



+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |             1.000 |             77.300 |              77.300 |        169.867 |          171.495 |       177.300 |         12.776 |           12.776 |        12.777 |                77.300 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |             4.000 |            266.279 |             266.279 |        313.875 |          325.059 |       552.380 |         14.719 |           14.705 |        14.918 |                66.570 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            16.000 |            723.906 |             723.906 |        745.947 |          780.635 |       927.365 |         21.371 |           21.336 |        21.944 |                45.244 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |            32.000 |           1226.775 |            1226.775 |       1063.028 |         1116.789 |      1505.019 |         25.038 |           25.033 |        25.887 |                38.337 |

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist · 2025-08-02T06:57:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

JustinTong0323

Awesome

support dp attention for step3_vl

781b944

yhyang201 requested review from mickqian, JustinTong0323, merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan, BBuf and kushanam as code owners August 2, 2025 06:57

mickqian changed the title ~~Support DP Attention for step3_vl~~ feat: Support DP Attention for step3_vl Aug 2, 2025

mickqian approved these changes Aug 2, 2025

View reviewed changes

JustinTong0323 approved these changes Aug 2, 2025

View reviewed changes

ispobock assigned ispobock, yizhang2077, mickqian and JustinTong0323 and unassigned ispobock Aug 2, 2025

Merge branch 'main' into step3-dp-attention

0f2fa30

ispobock merged commit 00da906 into sgl-project:main Aug 3, 2025

htiennv pushed a commit to htiennv/sglang that referenced this pull request Aug 5, 2025

feat: Support DP Attention for step3_vl (sgl-project#8699)

c1d76ff

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

feat: Support DP Attention for step3_vl (#8699)

6301613

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

feat: Support DP Attention for step3_vl (#8699)

bc72d0e

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

feat: Support DP Attention for step3_vl (sgl-project#8699)

4bb8426

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

feat: Support DP Attention for step3_vl (sgl-project#8699)

3705385

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support DP Attention for step3_vl #8699

feat: Support DP Attention for step3_vl #8699

Uh oh!

yhyang201 commented Aug 2, 2025 •

edited by JustinTong0323

Loading

Uh oh!

gemini-code-assist bot commented Aug 2, 2025

Uh oh!

JustinTong0323 left a comment

Uh oh!

Uh oh!

feat: Support DP Attention for step3_vl #8699

feat: Support DP Attention for step3_vl #8699

Uh oh!

Conversation

yhyang201 commented Aug 2, 2025 • edited by JustinTong0323 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Aug 2, 2025

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yhyang201 commented Aug 2, 2025 •

edited by JustinTong0323

Loading