Skip to content

Conversation

yhyang201
Copy link
Collaborator

@yhyang201 yhyang201 commented Aug 2, 2025

Motivation

Support DP Attention for step3_vl

Modifications

In the implementation prior to step3_vl, DP Attention was already supported for the LLM.

This update extends DP Attention support to the vision model component of the VLM.

In vision.py, world_size and tp_size have been replaced with attn_tp_size. This change does not affect VLM models that previously used standard tensor parallelism.

For the vision model in step3_vl, when DP Attention is enabled:

  1. The Attention component uses SGLang's DP Attention logic.
  2. For the FFN component, since the vision model is a dense architecture, tensor parallelism is applied within each Attention DP Group—unlike the LLM's FFN, which performs TP across the full world size.

In other words, each Attn TP Group (i.e., DP Group) loads a full copy of the vision model.

Please note that this setup may not reflect best practice, and is intended solely to ensure that DP Attention can function correctly in step3_vl.

Accuracy Test

Benchmark & Profiling

 python -m sglang.launch_server --model-path stepfun-ai/step3-fp8 --enable-multimodal --enable-dp-attention --tp 8 --dp 8 --trust-remote-code --mem-fraction-static 0.8


curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  77.48
Total input tokens:                      502493
Total generated tokens:                  499251
Total generated tokens (retokenized):    497840
Request throughput (req/s):              12.91
Input token throughput (tok/s):          6485.58
Output token throughput (tok/s):         6443.74
Total token throughput (tok/s):          12929.32
Concurrency:                             665.05
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51526.79
Median E2E Latency (ms):                 54199.91
---------------Time to First Token----------------
Mean TTFT (ms):                          9328.53
Median TTFT (ms):                        9057.79
P99 TTFT (ms):                           17728.52
---------------Inter-Token Latency----------------
Mean ITL (ms):                           84.69
Median ITL (ms):                         72.46
P95 ITL (ms):                            85.48
P99 ITL (ms):                            102.15
Max ITL (ms):                            16096.83
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name mmmu --num-prompts 500 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     498
Benchmark duration (s):                  50.05
Total input tokens:                      33237
Total generated tokens:                  498000
Total generated tokens (retokenized):    494981
Request throughput (req/s):              9.95
Input token throughput (tok/s):          664.08
Output token throughput (tok/s):         9950.15
Total token throughput (tok/s):          10614.24
Concurrency:                             496.84
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   49933.01
Median E2E Latency (ms):                 49931.56
---------------Time to First Token----------------
Mean TTFT (ms):                          1342.71
Median TTFT (ms):                        1437.51
P99 TTFT (ms):                           1816.03
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.65
Median ITL (ms):                         48.42
P95 ITL (ms):                            52.76
P99 ITL (ms):                            57.66
Max ITL (ms):                            1130.78
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file res.jsonl 
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file res.jsonl

python3 test/srt/parse_results.py res.jsonl


+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |             43.247 |              43.247 |        200.871 |          200.299 |       207.934 |         22.943 |           22.952 |        23.044 |                43.247 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            161.095 |             161.095 |        433.940 |          379.289 |       546.825 |         24.416 |           24.439 |        24.697 |                40.274 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |            499.001 |             499.001 |        645.112 |          692.549 |       787.564 |         31.444 |           31.355 |        31.839 |                31.188 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |            947.501 |             947.501 |        889.175 |          891.117 |      1325.534 |         32.909 |           32.872 |        33.537 |                29.609 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+



========================================
 python -m sglang.launch_server --model-path stepfun-ai/step3-fp8 --enable-multimodal --tp 8 --trust-remote-code --mem-fraction-static 0.8


curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  248.04
Total input tokens:                      502493
Total generated tokens:                  499251
Total generated tokens (retokenized):    498015
Request throughput (req/s):              4.03
Input token throughput (tok/s):          2025.83
Output token throughput (tok/s):         2012.75
Total token throughput (tok/s):          4038.58
Concurrency:                             599.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   148667.98
Median E2E Latency (ms):                 162750.46
---------------Time to First Token----------------
Mean TTFT (ms):                          64073.61
Median TTFT (ms):                        33524.83
P99 TTFT (ms):                           175645.98
---------------Inter-Token Latency----------------
Mean ITL (ms):                           169.78
Median ITL (ms):                         141.00
P95 ITL (ms):                            294.74
P99 ITL (ms):                            308.06
Max ITL (ms):                            7152.99
==================================================




curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang --dataset-name mmmu --num-prompts 500 --random-output 1000


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     498
Benchmark duration (s):                  171.23
Total input tokens:                      33237
Total generated tokens:                  498000
Total generated tokens (retokenized):    494559
Request throughput (req/s):              2.91
Input token throughput (tok/s):          194.11
Output token throughput (tok/s):         2908.37
Total token throughput (tok/s):          3102.48
Concurrency:                             425.13
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   146174.79
Median E2E Latency (ms):                 141497.03
---------------Time to First Token----------------
Mean TTFT (ms):                          17318.98
Median TTFT (ms):                        1609.99
P99 TTFT (ms):                           145094.11
---------------Inter-Token Latency----------------
Mean ITL (ms):                           129.02
Median ITL (ms):                         137.79
P95 ITL (ms):                            149.56
P99 ITL (ms):                            197.43
Max ITL (ms):                            34289.14
==================================================



curl http://127.0.0.1:30000/flush_cache     
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file res.jsonl 
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file res.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file res.jsonl

python3 test/srt/parse_results.py res.jsonl



+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |             1.000 |             77.300 |              77.300 |        169.867 |          171.495 |       177.300 |         12.776 |           12.776 |        12.777 |                77.300 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |             4.000 |            266.279 |             266.279 |        313.875 |          325.059 |       552.380 |         14.719 |           14.705 |        14.918 |                66.570 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            16.000 |            723.906 |             723.906 |        745.947 |          780.635 |       927.365 |         21.371 |           21.336 |        21.944 |                45.244 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |            32.000 |           1226.775 |            1226.775 |       1063.028 |         1116.789 |      1505.019 |         25.038 |           25.033 |        25.887 |                38.337 |

Checklist

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mickqian mickqian changed the title Support DP Attention for step3_vl feat: Support DP Attention for step3_vl Aug 2, 2025
Copy link
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@ispobock ispobock merged commit 00da906 into sgl-project:main Aug 3, 2025
htiennv pushed a commit to htiennv/sglang that referenced this pull request Aug 5, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants