-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) #6923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Coarse-grained computations-communicate hidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Jackie-bytestar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Update linear.py", focuses on optimizing the Linear
layer within the sglang runtime (srt). The primary goal, as stated in the description, is to improve throughput by implementing compute-communication hiding. This is achieved by chunking the input data for large tensors and overlapping the computation of one chunk with the communication (specifically, the all-reduce operation in tensor parallelism) of the previous chunk using separate CUDA streams and events. The author, Jackie-bytestar, has provided benchmark results showing a noticeable improvement in throughput (from ~21444 tok/s to ~21821 tok/s) on a 4xH200 setup with a Qwen2.5-72B model and a specific benchmark command, which validates the effectiveness of this approach.
Highlights
- Performance Optimization: The core of this PR is a performance optimization targeting the
Linear
layer, aiming to improve throughput. - Compute-Communication Hiding: The optimization implements compute-communication hiding by processing large inputs in chunks and overlapping the matrix multiplication (compute) with the tensor parallel all-reduce (communication) using CUDA streams.
- Chunking and CUDA Streams: The implementation involves splitting large input tensors into chunks, performing the linear operation on each chunk, and using a dedicated CUDA stream for the all-reduce operation to enable overlap with the computation of the next chunk.
- Dynamic Chunk Size: A default chunk size is introduced, but for very large inputs (>= 16384), the chunk size is dynamically adjusted based on the input dimension.
Changelog
- python/sglang/srt/layers/linear.py
- Added a
chunk_size
attribute to theLinear
class, initialized to 5120. - Modified the
forward
method to implement chunking logic for inputs with size >= 16384. - Dynamically set
self.chunk_size
toinput_.size(0)//3
for large inputs. - Introduced a separate CUDA stream (
comm_stream
) for thetensor_model_parallel_all_reduce
operation. - Added CUDA events (
compute_events
,comm_events
) to synchronize computation and communication streams. - Implemented a loop to process input chunks, performing computation and overlapping communication where applicable.
- Added logic to fall back to the original non-chunked forward pass for small inputs (less than 2 chunks).
- Added a
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces chunking and CUDA stream utilization in the RowParallelLinear
layer's forward
method, aiming to improve throughput by overlapping computation and communication. The benchmark results show a modest improvement, which is a good step.
However, there are a few critical and medium-severity issues that need addressing, including a potential UnboundLocalError
for output_bias
, the use of magic numbers, and questions regarding CUDA stream synchronization logic. Additionally, the unit test checklist item in the PR description is unchecked, which is a concern for new complex logic like this.
Summary of Findings
- Potential
UnboundLocalError
foroutput_bias
: In the chunked execution path of theforward
method,output_bias
is not defined before being returned, which will likely cause a runtime error. This is a critical issue. - CUDA Stream Synchronization in Non-Communicating Path: The CUDA stream synchronization logic in the
else
branch of the chunking loop (when noall_reduce
is performed) might serialize chunk processing rather than achieving pipelining. Clarification on the intended behavior is needed. (Severity: medium) - Use of Magic Numbers: Several numerical values (e.g.,
5120
,16384
,3
) related to chunking are hardcoded. These should ideally be named constants or configurable parameters with clear justification. (Severity: medium) - Mutability of
self.chunk_size
inforward
: The instance attributeself.chunk_size
is modified within theforward
method, which could lead to unexpected side effects if the layer instance is reused. Using a local variable for the current operation's chunk size is recommended. (Severity: medium) - Unit Testing for New Logic: The PR description's checklist for adding unit tests is unchecked. Given the complexity of the new chunking and CUDA stream logic, comprehensive unit tests are crucial to ensure correctness and prevent regressions. (Severity: medium)
- Code Duplication: There's some duplication of the core input processing and quantization logic between the non-chunked path and the chunked path within the
forward
method. While potentially acceptable for clarity or minor performance reasons, it slightly impacts maintainability. (Severity: low, not commented directly due to review settings) - Code Comments for Stream Logic: The new CUDA stream handling logic, especially the synchronization strategy, could benefit from inline comments to improve understanding and maintainability. (Severity: low, not commented directly due to review settings)
Merge Readiness
This pull request introduces a promising approach to improve throughput. However, there is a critical bug concerning output_bias
that must be fixed. Additionally, there are medium-severity concerns regarding magic numbers, instance variable mutation, and CUDA stream synchronization that should be addressed. It's also highly recommended to add unit tests for the new complex logic.
Due to these issues, particularly the critical bug, I recommend that changes be made before this PR is merged. I am not authorized to approve pull requests.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Coarse-grained computations-communicate hidden
Motivation
Compute-communication hiding by means of chunking and cuda stream to improve throughput.
Checklist
Experiment setup
4×H200
command:
Results
The baseline is SGLang Release v0.4.6
Repeated 5 times, get the following results for the throughput test(by the bench_offline_throughput.py):
Baseline: 21457.22 tok/s, 21460.65 tok/s, 21452.04 tok/s, 21460.67 tok/s, 21392.38 tok/s
pr: 21854.43 tok/s, 21846.43 tok/s, 21789.11 tok/s, 21831.50 tok/s, 21784.51 tok/s
average:
baseline: 21444.592 tok/s
pr: 21821.196 tok/s