Do layernorm before allgather for DP attention #8631

trevor-m · 2025-07-31T19:09:15Z

Motivation

By doing layernorm before all-gather, we operate on 1/DPth of the tokens reducing the computation time.

Modifications

Perform layernorm before DP gather in layer communicator.
Currently only enabled when DP==TP.

Accuracy Test

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000
Accuracy: 0.959
Invalid: 0.000
Latency: 22.890 s
Output throughput: 6335.267 token/s

Benchmark & Profiling

Speedup: 3.79% end to end

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1024 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 1024

BEFORE

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  76.79
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1046065
Request throughput (req/s):              13.33
Input token throughput (tok/s):          13655.01
Output token throughput (tok/s):         13655.01
Total token throughput (tok/s):          27310.02
Concurrency:                             1021.18
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   76579.19
Median E2E Latency (ms):                 76556.41
---------------Time to First Token----------------
Mean TTFT (ms):                          11978.29
Median TTFT (ms):                        11876.27
P99 TTFT (ms):                           21513.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           63.15
Median ITL (ms):                         53.14
P95 ITL (ms):                            69.91
P99 ITL (ms):                            77.50
Max ITL (ms):                            18880.99
==================================================

AFTER

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  73.99
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1044921
Request throughput (req/s):              13.84
Input token throughput (tok/s):          14172.69
Output token throughput (tok/s):         14172.69
Total token throughput (tok/s):          28345.37
Concurrency:                             1021.65
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   73816.06
Median E2E Latency (ms):                 73829.85
---------------Time to First Token----------------
Mean TTFT (ms):                          10411.04
Median TTFT (ms):                        10263.18
P99 TTFT (ms):                           19516.86
---------------Inter-Token Latency----------------
Mean ITL (ms):                           61.98
Median ITL (ms):                         52.59
P95 ITL (ms):                            67.60
P99 ITL (ms):                            74.52
Max ITL (ms):                            17244.45
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @trevor-m, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a critical performance optimization within the data parallel communication layer. By strategically moving the layer normalization step to occur before the all-gather operation, the system can perform computations on smaller, local data chunks, leading to reduced overhead and improved overall throughput. The changes are validated by significant benchmark improvements without any reported accuracy regressions.

Highlights

Performance Optimization: Reordered the Layer Normalization (layernorm) operation to occur before the Data Parallel (DP) gather operation within the _gather_hidden_states_and_residual function in python/sglang/srt/layers/communicator.py.
Efficiency Improvement: This change allows layernorm to operate on a smaller, local subset of tokens (specifically, 1/DPth of the tokens), which significantly reduces the computational time required for the normalization step.
Code Refinement: Removed an unnecessary dp_scatter operation, simplifying the data flow within the communicator layer.
Benchmark Results: Achieved a notable 3.79% end-to-end speedup, with total token throughput increasing from approximately 27310 tokens/s to 28345 tokens/s, demonstrating the effectiveness of the optimization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to improve performance by moving the Layer Normalization operation before the all-gather in data parallel settings. The benchmark results show a speedup.

My review identifies a critical correctness issue. The change, combined with existing code, leads to inconsistent tensors across tensor parallel ranks when tp_size is not equal to dp_size. This will cause silent numerical errors in such configurations.

python/sglang/srt/layers/communicator.py

kaixih · 2025-08-01T04:13:03Z

LGTM! Thx.

kushanam · 2025-08-01T22:12:52Z

@ch-wan could you please take a look? tnx

ch-wan

LGTM

trevor-m requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan, BBuf and kushanam as code owners July 31, 2025 19:09

gemini-code-assist bot reviewed Jul 31, 2025

View reviewed changes

python/sglang/srt/layers/communicator.py Outdated Show resolved Hide resolved

trevor-m changed the title ~~Do layernorm before allgather for DP~~ Draft: Do layernorm before allgather for DP Jul 31, 2025

trevor-m added 2 commits July 31, 2025 22:46

Do layernorm before allgather for DP

2d40945

Only use when tp==dp

aab764f

trevor-m force-pushed the layernorm-ag branch from d93cde5 to aab764f Compare July 31, 2025 22:50

trevor-m changed the title ~~Draft: Do layernorm before allgather for DP~~ Do layernorm before allgather for DP Jul 31, 2025

trevor-m changed the title ~~Do layernorm before allgather for DP~~ Do layernorm before allgather for DP attention Jul 31, 2025

kushanam added 2 commits August 1, 2025 11:37

Merge branch 'main' into layernorm-ag

8b4d37c

Merge branch 'main' into layernorm-ag

85f3d34

kushanam assigned ch-wan Aug 1, 2025

zhyncs self-assigned this Aug 1, 2025

zhyncs added the high priority label Aug 1, 2025

ch-wan approved these changes Aug 3, 2025

View reviewed changes

ch-wan merged commit 32f2815 into sgl-project:main Aug 3, 2025
60 of 64 checks passed

elfiegg mentioned this pull request Aug 5, 2025

[bug fix] Ensure local token and global token buffers are pointing to different storage #8785

Merged

6 tasks

htiennv pushed a commit to htiennv/sglang that referenced this pull request Aug 5, 2025

Do layernorm before allgather for DP attention (sgl-project#8631)

98b32d2

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Do layernorm before allgather for DP attention (#8631)

de592a5

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Do layernorm before allgather for DP attention (#8631)

ade8e41

trevor-m mentioned this pull request Aug 6, 2025

[bugfix] Fix storage of hidden states with "layernorm before all gather" #8875

Closed

6 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Do layernorm before allgather for DP attention (sgl-project#8631)

3c71f8e

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

Do layernorm before allgather for DP attention (sgl-project#8631)

5d566d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do layernorm before allgather for DP attention #8631

Do layernorm before allgather for DP attention #8631

Uh oh!

trevor-m commented Jul 31, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kaixih commented Aug 1, 2025

Uh oh!

kushanam commented Aug 1, 2025

Uh oh!

ch-wan left a comment

Uh oh!

Uh oh!

Uh oh!

Do layernorm before allgather for DP attention #8631

Do layernorm before allgather for DP attention #8631

Uh oh!

Conversation

trevor-m commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kaixih commented Aug 1, 2025

Uh oh!

kushanam commented Aug 1, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

trevor-m commented Jul 31, 2025 •

edited

Loading