Skip to content

Conversation

Jacki1223
Copy link

@Jacki1223 Jacki1223 commented Jun 6, 2025

Coarse-grained computations-communicate hidden

Motivation

Compute-communication hiding by means of chunking and cuda stream to improve throughput.

Checklist

Experiment setup

4×H200

command:

export CUDA_VISIBLE_DEVICES=4,5,6,7
model_path=/models/Qwen2.5-72B-Instruct-FP8-PER-TENSOR
dataset_path=/data/ShareGPT_V3_unfiltered_cleaned_split.json
python -m sglang.bench_offline_throughput \
        --model-path ${model_path} \
        --dataset-name random \
        --random-input 1024 \
        --random-output 1 \
        --num-prompts 1000 \
        --dataset-path  ${dataset_path} \
        --chunked-prefill-size 16384 \
        --tp-size 4

Results

The baseline is SGLang Release v0.4.6

Repeated 5 times, get the following results for the throughput test(by the bench_offline_throughput.py):
Baseline: 21457.22 tok/s, 21460.65 tok/s, 21452.04 tok/s, 21460.67 tok/s, 21392.38 tok/s
pr: 21854.43 tok/s, 21846.43 tok/s, 21789.11 tok/s, 21831.50 tok/s, 21784.51 tok/s

average:

baseline: 21444.592 tok/s
pr: 21821.196 tok/s

Coarse-grained computations-communicate hidden
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Jackie-bytestar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Update linear.py", focuses on optimizing the Linear layer within the sglang runtime (srt). The primary goal, as stated in the description, is to improve throughput by implementing compute-communication hiding. This is achieved by chunking the input data for large tensors and overlapping the computation of one chunk with the communication (specifically, the all-reduce operation in tensor parallelism) of the previous chunk using separate CUDA streams and events. The author, Jackie-bytestar, has provided benchmark results showing a noticeable improvement in throughput (from ~21444 tok/s to ~21821 tok/s) on a 4xH200 setup with a Qwen2.5-72B model and a specific benchmark command, which validates the effectiveness of this approach.

Highlights

  • Performance Optimization: The core of this PR is a performance optimization targeting the Linear layer, aiming to improve throughput.
  • Compute-Communication Hiding: The optimization implements compute-communication hiding by processing large inputs in chunks and overlapping the matrix multiplication (compute) with the tensor parallel all-reduce (communication) using CUDA streams.
  • Chunking and CUDA Streams: The implementation involves splitting large input tensors into chunks, performing the linear operation on each chunk, and using a dedicated CUDA stream for the all-reduce operation to enable overlap with the computation of the next chunk.
  • Dynamic Chunk Size: A default chunk size is introduced, but for very large inputs (>= 16384), the chunk size is dynamically adjusted based on the input dimension.

Changelog

  • python/sglang/srt/layers/linear.py
    • Added a chunk_size attribute to the Linear class, initialized to 5120.
    • Modified the forward method to implement chunking logic for inputs with size >= 16384.
    • Dynamically set self.chunk_size to input_.size(0)//3 for large inputs.
    • Introduced a separate CUDA stream (comm_stream) for the tensor_model_parallel_all_reduce operation.
    • Added CUDA events (compute_events, comm_events) to synchronize computation and communication streams.
    • Implemented a loop to process input chunks, performing computation and overlapping communication where applicable.
    • Added logic to fall back to the original non-chunked forward pass for small inputs (less than 2 chunks).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces chunking and CUDA stream utilization in the RowParallelLinear layer's forward method, aiming to improve throughput by overlapping computation and communication. The benchmark results show a modest improvement, which is a good step.

However, there are a few critical and medium-severity issues that need addressing, including a potential UnboundLocalError for output_bias, the use of magic numbers, and questions regarding CUDA stream synchronization logic. Additionally, the unit test checklist item in the PR description is unchecked, which is a concern for new complex logic like this.

Summary of Findings

  • Potential UnboundLocalError for output_bias: In the chunked execution path of the forward method, output_bias is not defined before being returned, which will likely cause a runtime error. This is a critical issue.
  • CUDA Stream Synchronization in Non-Communicating Path: The CUDA stream synchronization logic in the else branch of the chunking loop (when no all_reduce is performed) might serialize chunk processing rather than achieving pipelining. Clarification on the intended behavior is needed. (Severity: medium)
  • Use of Magic Numbers: Several numerical values (e.g., 5120, 16384, 3) related to chunking are hardcoded. These should ideally be named constants or configurable parameters with clear justification. (Severity: medium)
  • Mutability of self.chunk_size in forward: The instance attribute self.chunk_size is modified within the forward method, which could lead to unexpected side effects if the layer instance is reused. Using a local variable for the current operation's chunk size is recommended. (Severity: medium)
  • Unit Testing for New Logic: The PR description's checklist for adding unit tests is unchecked. Given the complexity of the new chunking and CUDA stream logic, comprehensive unit tests are crucial to ensure correctness and prevent regressions. (Severity: medium)
  • Code Duplication: There's some duplication of the core input processing and quantization logic between the non-chunked path and the chunked path within the forward method. While potentially acceptable for clarity or minor performance reasons, it slightly impacts maintainability. (Severity: low, not commented directly due to review settings)
  • Code Comments for Stream Logic: The new CUDA stream handling logic, especially the synchronization strategy, could benefit from inline comments to improve understanding and maintainability. (Severity: low, not commented directly due to review settings)

Merge Readiness

This pull request introduces a promising approach to improve throughput. However, there is a critical bug concerning output_bias that must be fixed. Additionally, there are medium-severity concerns regarding magic numbers, instance variable mutation, and CUDA stream synchronization that should be addressed. It's also highly recommended to add unit tests for the new complex logic.

Due to these issues, particularly the critical bug, I recommend that changes be made before this PR is merged. I am not authorized to approve pull requests.

Jacki1223 and others added 3 commits June 6, 2025 19:54
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@Jacki1223 Jacki1223 changed the title Update linear.py Coarse-grained computations-communicate hidden Jun 6, 2025
@Jacki1223 Jacki1223 marked this pull request as ready for review June 7, 2025 06:24
@Jacki1223 Jacki1223 changed the title Coarse-grained computations-communicate hidden Coarse-grained computations-communicate hidden(Throughput Improvement) Jun 10, 2025
@Jacki1223 Jacki1223 changed the title Coarse-grained computations-communicate hidden(Throughput Improvement) Coarse-grained computations-communicate overlap(Throughput Improvement) Jun 10, 2025
@Jacki1223 Jacki1223 changed the title Coarse-grained computations-communicate overlap(Throughput Improvement) [Feature] Coarse-grained computations-communicate overlap(Throughput Improvement) Jun 10, 2025
@Jacki1223 Jacki1223 mentioned this pull request Jun 11, 2025
6 tasks
@Jacki1223 Jacki1223 closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant