[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

lifuhuang · 2025-06-09T08:53:58Z

This PR improves LoRA inference performance by:

Eliminating unnecessary CUDA stream synchronizations
Reducing redundant computations

These changes help us achieve our goal of making the LoRA batch initialization process free of CUDA syncs, as outlined in #6961 .

Benchmark results show that this PR, combined with my previous PR (#6960), reduces TTFT (P50) by 31.4% and ITL (P50) by 34.0%.

Benchmark Result

Benchmark Script

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache --lora-paths lora=algoprog/fact-generation-llama-3.1-8b-instruct-lora
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora

	Baseline	#6960 (merged)	This PR (including #6960 )
ITL@P95	78.42 ms	68.24 ms (-13.0%)	52.51 (-33.0%)
ITL@P50	34.36 ms	32.85 ms (-4.4%)	22.68 (-34.0%)
TTFT@P50	91.37 ms	85.52 ms (-6.5%)	62.65 (-31.4%)

Before (See the huge bubble caused by cudaStreamSynchronize)

After (All cudaStreamSynchronize are removed)

Comparison between CUDA Graph enabled vs disabled

	Baseline	#6861 (merged)	This PR
CUDA Graph Enabled
CUDA Graph Disabled

Modifications

(Generated by Copilot..)

LoRA Batch Processing Enhancements:

Added a new helper function transfer_adapter_info in lora_manager.py to handle asynchronous transfer of adapter metadata (weight indices, LoRA ranks, and scalings) from the host to the CUDA device. This reduces synchronization overhead and improves performance.
Improved the initialization of seg_lens and seg_indptr for CUDA graphs in init_cuda_graph_batch_info. These values are now precomputed and remain constant across batches, reducing redundant computations.

Memory Management Simplifications:

Simplified the get_available_buffer_slot method in mem_pool.py by removing the unnecessary return of evicted LoRA UIDs. Eviction logic is now handled internally, improving code clarity.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @lifuhuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request authored by @lifuhuang. This PR focuses on optimizing the performance of LoRA inference within the SGLang framework. The primary goals are to eliminate performance bottlenecks caused by CUDA stream synchronizations and reduce redundant computations in the LoRA management logic. The description highlights significant improvements in latency metrics (ITL and TTFT), showing reductions of over 30% compared to the baseline, and includes profiling traces demonstrating the successful removal of synchronization bubbles.

Highlights

Performance Optimization: The core objective is to significantly improve LoRA inference speed by addressing key performance inhibitors.
Eliminate Stream Synchronizations: A major focus is removing explicit cudaStreamSynchronize calls, which are identified as causing significant delays and bubbles in the execution timeline, as shown in the profiling results.
Reduce Redundant Computations: Optimizes the preparation of batch information, particularly for CUDA graphs, by pre-calculating and reusing constant values.
Asynchronous Data Transfer: Introduces the use of pinned memory and non-blocking copies for transferring LoRA adapter metadata (indices, ranks, scalings) from host to device, enabling better overlap of computation and data transfer.
Memory Pool Simplification: Refactors the LoRA memory pool's buffer slot allocation logic for improved clarity and efficiency.

Changelog

python/sglang/srt/lora/lora_manager.py
- In init_cuda_graph_batch_info, max_len is initialized to 1 instead of 0 (L84).
- Pre-calculation and initialization of seg_lens and seg_indptr for CUDA graphs are moved into init_cuda_graph_batch_info as these values are constant across batches (L92-101).
- A new helper function transfer_adapter_info is added to handle the asynchronous transfer of weight_indices, lora_ranks, and scalings using pinned memory and non_blocking copies (L173-210).
- The logic in prepare_lora_batch for populating weight_indices, lora_ranks, and scalings is replaced by calls to the new transfer_adapter_info function for both CUDA graph (L220-224) and non-CUDA graph paths (L237-241).
- In the non-CUDA graph path, max_len is now calculated from the CPU copy of extend_seq_lens_cpu to avoid a device-to-host transfer (L249-254).
python/sglang/srt/lora/mem_pool.py
- The get_available_buffer_slot helper function is modified to return only the buffer_id (L135, L141).
- The logic for popping the evicted LoRA UID from self.uid_to_buffer_id is moved inside the get_available_buffer_slot function itself (L140).
- The main loop calling get_available_buffer_slot is updated to only expect a single return value (L149).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR brings impressive performance improvements to LoRA inference by strategically eliminating stream synchronizations and pre-computing certain data structures. The new transfer_adapter_info function is a good step towards centralizing asynchronous data transfer logic, and the simplification in mem_pool.py enhances clarity.

My main concerns revolve around data type consistency for lora_ranks and weight_indices between the CUDA graph path and the regular path, which could impact correctness if not aligned with kernel expectations.

Given the complexity of asynchronous operations and CUDA graph interactions, I'd also recommend considering if additional unit tests are needed to specifically cover the new data transfer and pre-computation logic, even though the existing test suite might provide some coverage. The "Add unit tests" item in the PR checklist is currently unchecked.

Summary of Findings

Data Type Inconsistency for lora_ranks: lora_ranks is torch.int32 in cuda_graph_batch_info but handled as torch.int64 in transfer_adapter_info and the non-CUDA graph path. This could lead to issues if CUDA kernels expect a specific type.
Data Type Inconsistency for weight_indices: weight_indices is torch.int32 in cuda_graph_batch_info and for the tensor created in transfer_adapter_info, but initialized as torch.int64 in the non-CUDA graph path. This could cause problems if kernels expect torch.int32.
Unit Testing for Asynchronous Logic: The PR introduces significant changes to data handling with asynchronous operations and CUDA graph interactions. While existing tests might offer some coverage, specific unit tests for this new logic would be beneficial to ensure correctness, especially since the 'Add unit tests' checklist item is unchecked. (No comment added due to review settings).

Merge Readiness

The pull request demonstrates significant performance gains, which is excellent. However, the identified data type inconsistencies for lora_ranks and weight_indices are high-severity concerns that should be addressed to ensure correctness and prevent potential issues with CUDA kernel execution. Once these are resolved, and perhaps after considering the addition of targeted unit tests for the new asynchronous logic, the PR will be in a much stronger position for merging. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before this WIP PR is finalized and merged.

python/sglang/srt/lora/lora_manager.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fridge003

LGTM

python/sglang/srt/lora/lora_manager.py

Qiaolin-Yu

Great work! LGTM

…omputations (sgl-project#6994)

lifuhuang added 3 commits June 8, 2025 00:30

Eliminate cudaStreamSynchronize in QKVParallelLinearWithLoRA.

51d9447

Eliminate all cuda stream syncs in prepare_lora_batch.

d6b68de

Merge remote-tracking branch 'origin/main' into lifuhuang/lora-param

747adfb

lifuhuang requested review from Ying1123 and Fridge003 as code owners June 9, 2025 08:54

lifuhuang had a problem deploying to prod June 9, 2025 08:54 — with GitHub Actions Error

gemini-code-assist bot reviewed Jun 9, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 9, 2025

View reviewed changes

python/sglang/srt/lora/lora_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/lora/lora_manager.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6ae874a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lifuhuang temporarily deployed to prod June 9, 2025 08:59 — with GitHub Actions Inactive

lifuhuang changed the title ~~[Perf][WIP] Refactor LoRAManager to eliminate stream syncs and redundant computations~~ [Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations Jun 11, 2025

Merge branch 'main' into lifuhuang/lora-param

177527d

lifuhuang mentioned this pull request Jun 11, 2025

[Perf] Speed up LoRA Batch Initialization #6961

Closed

3 tasks

Fridge003 approved these changes Jun 11, 2025

View reviewed changes

Merge branch 'main' into lifuhuang/lora-param

0c3cdff

hebiao064 reviewed Jun 11, 2025

View reviewed changes

python/sglang/srt/lora/lora_manager.py Show resolved Hide resolved

Qiaolin-Yu approved these changes Jun 11, 2025

View reviewed changes

Fridge003 merged commit 021f76e into main Jun 11, 2025
115 of 138 checks passed

Fridge003 deleted the lifuhuang/lora-param branch June 11, 2025 23:18

jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant c…

9a1df0a

…omputations (sgl-project#6994)

lifuhuang mentioned this pull request Jun 16, 2025

[Feature] Phi-4-MM support #6544

Open

7 tasks

lifuhuang added performance lora labels Jul 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

Uh oh!

lifuhuang commented Jun 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 left a comment

Uh oh!

Uh oh!

Qiaolin-Yu left a comment

Uh oh!

Uh oh!

Uh oh!

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994

Uh oh!

Conversation

lifuhuang commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Result

Comparison between CUDA Graph enabled vs disabled

Modifications

LoRA Batch Processing Enhancements:

Memory Management Simplifications:

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Qiaolin-Yu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Jun 9, 2025 •

edited

Loading