-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request authored by @lifuhuang. This PR focuses on optimizing the performance of LoRA inference within the SGLang framework. The primary goals are to eliminate performance bottlenecks caused by CUDA stream synchronizations and reduce redundant computations in the LoRA management logic. The description highlights significant improvements in latency metrics (ITL and TTFT), showing reductions of over 30% compared to the baseline, and includes profiling traces demonstrating the successful removal of synchronization bubbles.
Highlights
- Performance Optimization: The core objective is to significantly improve LoRA inference speed by addressing key performance inhibitors.
- Eliminate Stream Synchronizations: A major focus is removing explicit
cudaStreamSynchronize
calls, which are identified as causing significant delays and bubbles in the execution timeline, as shown in the profiling results. - Reduce Redundant Computations: Optimizes the preparation of batch information, particularly for CUDA graphs, by pre-calculating and reusing constant values.
- Asynchronous Data Transfer: Introduces the use of pinned memory and non-blocking copies for transferring LoRA adapter metadata (indices, ranks, scalings) from host to device, enabling better overlap of computation and data transfer.
- Memory Pool Simplification: Refactors the LoRA memory pool's buffer slot allocation logic for improved clarity and efficiency.
Changelog
- python/sglang/srt/lora/lora_manager.py
- In
init_cuda_graph_batch_info
,max_len
is initialized to 1 instead of 0 (L84). - Pre-calculation and initialization of
seg_lens
andseg_indptr
for CUDA graphs are moved intoinit_cuda_graph_batch_info
as these values are constant across batches (L92-101). - A new helper function
transfer_adapter_info
is added to handle the asynchronous transfer ofweight_indices
,lora_ranks
, andscalings
using pinned memory and non_blocking copies (L173-210). - The logic in
prepare_lora_batch
for populatingweight_indices
,lora_ranks
, andscalings
is replaced by calls to the newtransfer_adapter_info
function for both CUDA graph (L220-224) and non-CUDA graph paths (L237-241). - In the non-CUDA graph path,
max_len
is now calculated from the CPU copy ofextend_seq_lens_cpu
to avoid a device-to-host transfer (L249-254).
- In
- python/sglang/srt/lora/mem_pool.py
- The
get_available_buffer_slot
helper function is modified to return only thebuffer_id
(L135, L141). - The logic for popping the evicted LoRA UID from
self.uid_to_buffer_id
is moved inside theget_available_buffer_slot
function itself (L140). - The main loop calling
get_available_buffer_slot
is updated to only expect a single return value (L149).
- The
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR brings impressive performance improvements to LoRA inference by strategically eliminating stream synchronizations and pre-computing certain data structures. The new transfer_adapter_info
function is a good step towards centralizing asynchronous data transfer logic, and the simplification in mem_pool.py
enhances clarity.
My main concerns revolve around data type consistency for lora_ranks
and weight_indices
between the CUDA graph path and the regular path, which could impact correctness if not aligned with kernel expectations.
Given the complexity of asynchronous operations and CUDA graph interactions, I'd also recommend considering if additional unit tests are needed to specifically cover the new data transfer and pre-computation logic, even though the existing test suite might provide some coverage. The "Add unit tests" item in the PR checklist is currently unchecked.
Summary of Findings
- Data Type Inconsistency for
lora_ranks
:lora_ranks
istorch.int32
incuda_graph_batch_info
but handled astorch.int64
intransfer_adapter_info
and the non-CUDA graph path. This could lead to issues if CUDA kernels expect a specific type. - Data Type Inconsistency for
weight_indices
:weight_indices
istorch.int32
incuda_graph_batch_info
and for the tensor created intransfer_adapter_info
, but initialized astorch.int64
in the non-CUDA graph path. This could cause problems if kernels expecttorch.int32
. - Unit Testing for Asynchronous Logic: The PR introduces significant changes to data handling with asynchronous operations and CUDA graph interactions. While existing tests might offer some coverage, specific unit tests for this new logic would be beneficial to ensure correctness, especially since the 'Add unit tests' checklist item is unchecked. (No comment added due to review settings).
Merge Readiness
The pull request demonstrates significant performance gains, which is excellent. However, the identified data type inconsistencies for lora_ranks
and weight_indices
are high-severity concerns that should be addressed to ensure correctness and prevent potential issues with CUDA kernel execution. Once these are resolved, and perhaps after considering the addition of targeted unit tests for the new asynchronous logic, the PR will be in a much stronger position for merging. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before this WIP PR is finalized and merged.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! LGTM
This PR improves LoRA inference performance by:
These changes help us achieve our goal of making the LoRA batch initialization process free of CUDA syncs, as outlined in #6961 .
Benchmark results show that this PR, combined with my previous PR (#6960), reduces TTFT (P50) by 31.4% and ITL (P50) by 34.0%.
Benchmark Result
Benchmark Script
Before (See the huge bubble caused by cudaStreamSynchronize)

After (All cudaStreamSynchronize are removed)

Comparison between CUDA Graph enabled vs disabled
Modifications
(Generated by Copilot..)
LoRA Batch Processing Enhancements:
transfer_adapter_info
inlora_manager.py
to handle asynchronous transfer of adapter metadata (weight indices, LoRA ranks, and scalings) from the host to the CUDA device. This reduces synchronization overhead and improves performance.seg_lens
andseg_indptr
for CUDA graphs ininit_cuda_graph_batch_info
. These values are now precomputed and remain constant across batches, reducing redundant computations.Memory Management Simplifications:
get_available_buffer_slot
method inmem_pool.py
by removing the unnecessary return of evicted LoRA UIDs. Eviction logic is now handled internally, improving code clarity.Checklist