Skip to content

Conversation

lifuhuang
Copy link
Collaborator

@lifuhuang lifuhuang commented Jun 9, 2025

This PR improves LoRA inference performance by:

  • Eliminating unnecessary CUDA stream synchronizations
  • Reducing redundant computations

These changes help us achieve our goal of making the LoRA batch initialization process free of CUDA syncs, as outlined in #6961 .

Benchmark results show that this PR, combined with my previous PR (#6960), reduces TTFT (P50) by 31.4% and ITL (P50) by 34.0%.

Benchmark Result

Benchmark Script

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache --lora-paths lora=algoprog/fact-generation-llama-3.1-8b-instruct-lora
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora
Baseline #6960 (merged) This PR (including #6960 )
ITL@P95 78.42 ms 68.24 ms (-13.0%) 52.51 (-33.0%)
ITL@P50 34.36 ms 32.85 ms (-4.4%) 22.68 (-34.0%)
TTFT@P50 91.37 ms 85.52 ms (-6.5%) 62.65 (-31.4%)

Before (See the huge bubble caused by cudaStreamSynchronize)
image

After (All cudaStreamSynchronize are removed)
image

Comparison between CUDA Graph enabled vs disabled

Baseline #6861 (merged) This PR
CUDA Graph Enabled image image image
CUDA Graph Disabled image image image

Modifications

(Generated by Copilot..)

LoRA Batch Processing Enhancements:

  • Added a new helper function transfer_adapter_info in lora_manager.py to handle asynchronous transfer of adapter metadata (weight indices, LoRA ranks, and scalings) from the host to the CUDA device. This reduces synchronization overhead and improves performance.
  • Improved the initialization of seg_lens and seg_indptr for CUDA graphs in init_cuda_graph_batch_info. These values are now precomputed and remain constant across batches, reducing redundant computations.

Memory Management Simplifications:

  • Simplified the get_available_buffer_slot method in mem_pool.py by removing the unnecessary return of evicted LoRA UIDs. Eviction logic is now handled internally, improving code clarity.

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request authored by @lifuhuang. This PR focuses on optimizing the performance of LoRA inference within the SGLang framework. The primary goals are to eliminate performance bottlenecks caused by CUDA stream synchronizations and reduce redundant computations in the LoRA management logic. The description highlights significant improvements in latency metrics (ITL and TTFT), showing reductions of over 30% compared to the baseline, and includes profiling traces demonstrating the successful removal of synchronization bubbles.

Highlights

  • Performance Optimization: The core objective is to significantly improve LoRA inference speed by addressing key performance inhibitors.
  • Eliminate Stream Synchronizations: A major focus is removing explicit cudaStreamSynchronize calls, which are identified as causing significant delays and bubbles in the execution timeline, as shown in the profiling results.
  • Reduce Redundant Computations: Optimizes the preparation of batch information, particularly for CUDA graphs, by pre-calculating and reusing constant values.
  • Asynchronous Data Transfer: Introduces the use of pinned memory and non-blocking copies for transferring LoRA adapter metadata (indices, ranks, scalings) from host to device, enabling better overlap of computation and data transfer.
  • Memory Pool Simplification: Refactors the LoRA memory pool's buffer slot allocation logic for improved clarity and efficiency.

Changelog

  • python/sglang/srt/lora/lora_manager.py
    • In init_cuda_graph_batch_info, max_len is initialized to 1 instead of 0 (L84).
    • Pre-calculation and initialization of seg_lens and seg_indptr for CUDA graphs are moved into init_cuda_graph_batch_info as these values are constant across batches (L92-101).
    • A new helper function transfer_adapter_info is added to handle the asynchronous transfer of weight_indices, lora_ranks, and scalings using pinned memory and non_blocking copies (L173-210).
    • The logic in prepare_lora_batch for populating weight_indices, lora_ranks, and scalings is replaced by calls to the new transfer_adapter_info function for both CUDA graph (L220-224) and non-CUDA graph paths (L237-241).
    • In the non-CUDA graph path, max_len is now calculated from the CPU copy of extend_seq_lens_cpu to avoid a device-to-host transfer (L249-254).
  • python/sglang/srt/lora/mem_pool.py
    • The get_available_buffer_slot helper function is modified to return only the buffer_id (L135, L141).
    • The logic for popping the evicted LoRA UID from self.uid_to_buffer_id is moved inside the get_available_buffer_slot function itself (L140).
    • The main loop calling get_available_buffer_slot is updated to only expect a single return value (L149).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR brings impressive performance improvements to LoRA inference by strategically eliminating stream synchronizations and pre-computing certain data structures. The new transfer_adapter_info function is a good step towards centralizing asynchronous data transfer logic, and the simplification in mem_pool.py enhances clarity.

My main concerns revolve around data type consistency for lora_ranks and weight_indices between the CUDA graph path and the regular path, which could impact correctness if not aligned with kernel expectations.

Given the complexity of asynchronous operations and CUDA graph interactions, I'd also recommend considering if additional unit tests are needed to specifically cover the new data transfer and pre-computation logic, even though the existing test suite might provide some coverage. The "Add unit tests" item in the PR checklist is currently unchecked.

Summary of Findings

  • Data Type Inconsistency for lora_ranks: lora_ranks is torch.int32 in cuda_graph_batch_info but handled as torch.int64 in transfer_adapter_info and the non-CUDA graph path. This could lead to issues if CUDA kernels expect a specific type.
  • Data Type Inconsistency for weight_indices: weight_indices is torch.int32 in cuda_graph_batch_info and for the tensor created in transfer_adapter_info, but initialized as torch.int64 in the non-CUDA graph path. This could cause problems if kernels expect torch.int32.
  • Unit Testing for Asynchronous Logic: The PR introduces significant changes to data handling with asynchronous operations and CUDA graph interactions. While existing tests might offer some coverage, specific unit tests for this new logic would be beneficial to ensure correctness, especially since the 'Add unit tests' checklist item is unchecked. (No comment added due to review settings).

Merge Readiness

The pull request demonstrates significant performance gains, which is excellent. However, the identified data type inconsistencies for lora_ranks and weight_indices are high-severity concerns that should be addressed to ensure correctness and prevent potential issues with CUDA kernel execution. Once these are resolved, and perhaps after considering the addition of targeted unit tests for the new asynchronous logic, the PR will be in a much stronger position for merging. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before this WIP PR is finalized and merged.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@lifuhuang lifuhuang changed the title [Perf][WIP] Refactor LoRAManager to eliminate stream syncs and redundant computations [Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations Jun 11, 2025
Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Qiaolin-Yu Qiaolin-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM

@Fridge003 Fridge003 merged commit 021f76e into main Jun 11, 2025
115 of 138 checks passed
@Fridge003 Fridge003 deleted the lifuhuang/lora-param branch June 11, 2025 23:18
jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025
@lifuhuang lifuhuang mentioned this pull request Jun 16, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants