[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923

Jacki1223 · 2025-06-06T11:26:28Z

Coarse-grained computations-communicate hidden

Motivation

Compute-communication hiding by means of chunking and cuda stream to improve throughput.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Experiment setup

4×H200

command:

export CUDA_VISIBLE_DEVICES=4,5,6,7
model_path=/models/Qwen2.5-72B-Instruct-FP8-PER-TENSOR
dataset_path=/data/ShareGPT_V3_unfiltered_cleaned_split.json
python -m sglang.bench_offline_throughput \
        --model-path ${model_path} \
        --dataset-name random \
        --random-input 1024 \
        --random-output 1 \
        --num-prompts 1000 \
        --dataset-path  ${dataset_path} \
        --chunked-prefill-size 16384 \
        --tp-size 4

Results

The baseline is SGLang Release v0.4.6

Repeated 5 times, get the following results for the throughput test(by the bench_offline_throughput.py)：
Baseline: 21457.22 tok/s, 21460.65 tok/s, 21452.04 tok/s, 21460.67 tok/s, 21392.38 tok/s
pr: 21854.43 tok/s, 21846.43 tok/s, 21789.11 tok/s, 21831.50 tok/s, 21784.51 tok/s

average:

baseline: 21444.592 tok/s
pr: 21821.196 tok/s

Coarse-grained computations-communicate hidden

gemini-code-assist

Hello @Jackie-bytestar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Update linear.py", focuses on optimizing the Linear layer within the sglang runtime (srt). The primary goal, as stated in the description, is to improve throughput by implementing compute-communication hiding. This is achieved by chunking the input data for large tensors and overlapping the computation of one chunk with the communication (specifically, the all-reduce operation in tensor parallelism) of the previous chunk using separate CUDA streams and events. The author, Jackie-bytestar, has provided benchmark results showing a noticeable improvement in throughput (from ~21444 tok/s to ~21821 tok/s) on a 4xH200 setup with a Qwen2.5-72B model and a specific benchmark command, which validates the effectiveness of this approach.

Highlights

Performance Optimization: The core of this PR is a performance optimization targeting the Linear layer, aiming to improve throughput.
Compute-Communication Hiding: The optimization implements compute-communication hiding by processing large inputs in chunks and overlapping the matrix multiplication (compute) with the tensor parallel all-reduce (communication) using CUDA streams.
Chunking and CUDA Streams: The implementation involves splitting large input tensors into chunks, performing the linear operation on each chunk, and using a dedicated CUDA stream for the all-reduce operation to enable overlap with the computation of the next chunk.
Dynamic Chunk Size: A default chunk size is introduced, but for very large inputs (>= 16384), the chunk size is dynamically adjusted based on the input dimension.

Changelog

python/sglang/srt/layers/linear.py
- Added a chunk_size attribute to the Linear class, initialized to 5120.
- Modified the forward method to implement chunking logic for inputs with size >= 16384.
- Dynamically set self.chunk_size to input_.size(0)//3 for large inputs.
- Introduced a separate CUDA stream (comm_stream) for the tensor_model_parallel_all_reduce operation.
- Added CUDA events (compute_events, comm_events) to synchronize computation and communication streams.
- Implemented a loop to process input chunks, performing computation and overlapping communication where applicable.
- Added logic to fall back to the original non-chunked forward pass for small inputs (less than 2 chunks).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces chunking and CUDA stream utilization in the RowParallelLinear layer's forward method, aiming to improve throughput by overlapping computation and communication. The benchmark results show a modest improvement, which is a good step.

However, there are a few critical and medium-severity issues that need addressing, including a potential UnboundLocalError for output_bias, the use of magic numbers, and questions regarding CUDA stream synchronization logic. Additionally, the unit test checklist item in the PR description is unchecked, which is a concern for new complex logic like this.

Summary of Findings

Potential UnboundLocalError for output_bias: In the chunked execution path of the forward method, output_bias is not defined before being returned, which will likely cause a runtime error. This is a critical issue.
CUDA Stream Synchronization in Non-Communicating Path: The CUDA stream synchronization logic in the else branch of the chunking loop (when no all_reduce is performed) might serialize chunk processing rather than achieving pipelining. Clarification on the intended behavior is needed. (Severity: medium)
Use of Magic Numbers: Several numerical values (e.g., 5120, 16384, 3) related to chunking are hardcoded. These should ideally be named constants or configurable parameters with clear justification. (Severity: medium)
Mutability of self.chunk_size in forward: The instance attribute self.chunk_size is modified within the forward method, which could lead to unexpected side effects if the layer instance is reused. Using a local variable for the current operation's chunk size is recommended. (Severity: medium)
Unit Testing for New Logic: The PR description's checklist for adding unit tests is unchecked. Given the complexity of the new chunking and CUDA stream logic, comprehensive unit tests are crucial to ensure correctness and prevent regressions. (Severity: medium)
Code Duplication: There's some duplication of the core input processing and quantization logic between the non-chunked path and the chunked path within the forward method. While potentially acceptable for clarity or minor performance reasons, it slightly impacts maintainability. (Severity: low, not commented directly due to review settings)
Code Comments for Stream Logic: The new CUDA stream handling logic, especially the synchronization strategy, could benefit from inline comments to improve understanding and maintainability. (Severity: low, not commented directly due to review settings)

Merge Readiness

This pull request introduces a promising approach to improve throughput. However, there is a critical bug concerning output_bias that must be fixed. Additionally, there are medium-severity concerns regarding magic numbers, instance variable mutation, and CUDA stream synchronization that should be addressed. It's also highly recommended to add unit tests for the new complex logic.

Due to these issues, particularly the critical bug, I recommend that changes be made before this PR is merged. I am not authorized to approve pull requests.

python/sglang/srt/layers/linear.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update linear.py

9dc5c99

Coarse-grained computations-communicate hidden

gemini-code-assist bot reviewed Jun 6, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 6, 2025

View reviewed changes

python/sglang/srt/layers/linear.py Show resolved Hide resolved

python/sglang/srt/layers/linear.py Show resolved Hide resolved

python/sglang/srt/layers/linear.py Show resolved Hide resolved

python/sglang/srt/layers/linear.py Show resolved Hide resolved

Jacki1223 and others added 3 commits June 6, 2025 19:54

Update linear.py

c74c5cd

Update python/sglang/srt/layers/linear.py

ec28c7f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update linear.py

1254cd4

Jacki1223 changed the title ~~Update linear.py~~ Coarse-grained computations-communicate hidden Jun 6, 2025

Jacki1223 marked this pull request as ready for review June 7, 2025 06:24

Jacki1223 requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan and BBuf as code owners June 7, 2025 06:24

Jacki1223 changed the title ~~Coarse-grained computations-communicate hidden~~ Coarse-grained computations-communicate hidden(Throughput Improvement) Jun 10, 2025

Jacki1223 changed the title ~~Coarse-grained computations-communicate hidden(Throughput Improvement)~~ Coarse-grained computations-communicate overlap(Throughput Improvement) Jun 10, 2025

Jacki1223 changed the title ~~Coarse-grained computations-communicate overlap(Throughput Improvement)~~ [Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) Jun 10, 2025

Merge branch 'main' into main

25127df

Jacki1223 mentioned this pull request Jun 11, 2025

Support overlapping two batches #4068

Merged

6 tasks

Jacki1223 closed this Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923

[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923

Uh oh!

Jacki1223 commented Jun 6, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923

[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923

Uh oh!

Conversation

Jacki1223 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Experiment setup

4×H200

command:

Results

The baseline is SGLang Release v0.4.6

average:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jacki1223 commented Jun 6, 2025 •

edited

Loading