Skip to content

Conversation

sushah13
Copy link

@sushah13 sushah13 commented Jun 5, 2025

Motivation

This PR introduces performance optimizations to two key components of the SGLang serving stack: radix_cache.py and schedule_batch.py. The primary goal is to reduce latency and improve throughput by eliminating inefficient data structures and unnecessary computations during request processing and KV cache eviction.

Modifications

Optimizations in radix_cache.py

1. Reduced latency via optimized eviction logic
Before: Average latency over 12 requests: 0.577s
After: Average latency over 12 requests: 0.495s

Evict Method Improvements (~60% speedup, from 31ms → 12ms):

  • Avoid rebuilding the heap with heapq.heapify() on every eviction

    • Previously, the eviction heap was rebuilt from scratch each time (O(n) cost).
    • ✅ Now: Maintain a persistent min-heap (self.evict_heap) and incrementally update it during insertions, lock/unlock updates, and deletions.
  • Avoid full-tree traversal in _collect_leaves()

    • Tree traversal for collecting evictable leaves was costly and unnecessary on each eviction.
    • ✅ Now: The persistent heap contains all evictable leaves, eliminating the need for traversal.
  • Avoid repeated heap rebalancing

    • Each heapq.heappop() call triggered rebalancing.
    • ✅ Now: evict() loop pops only unlocked and valid nodes using a guard, minimizing operations.

Implementation Details:

  • Added self.evict_heap: List[Tuple[int, TreeNode]] to RadixCache.
  • Heap is updated incrementally:
    • insert(): Add leaf if evictable.
    • inc_lock_ref(): Remove node from heap if it becomes locked.
    • dec_lock_ref(): Re-add node if it becomes unlockable.
    • _delete_leaf(): No-op for heap; eviction logic handles parent-child promotion.

2. Faster leaf deletion via key tracking
Previously, _delete_leaf() required a linear search in the parent’s children dict to find the key corresponding to the child node.
✅ Now: Each TreeNode stores its child_key upon insertion, eliminating the need to search during deletion.

Optimizations in schedule_batch.py

Before: Average latency over 12 requests: 0.495s
After: Average latency over 12 requests: 0.475s

Improvements:

  1. Flattening input IDs efficiently
    ❌ Before: sum(input_ids, []) (O(n²) operation due to repeated list copying).
    ✅ Now: list(chain.from_iterable(input_ids)) for linear-time flattening.

  2. Avoid redundant tensor construction
    ❌ Before: torch.tensor(seq_lens, ...) may create unnecessary copies.
    ✅ Now: torch.as_tensor(seq_lens, dtype=torch.int64, device=self.device) avoids extra copy if input is already a tensor-like object.

These optimizations contribute to overall reduced latency and improved batch throughput, especially under high request volume and large prompt size settings.

Checklist

##Profiling Screenshots:
Radix Cache:
BEFORE:
BEFORE - Radix Cache Profiler Screenshot
AFTER:
AFTER - Radix Cache Profiler Screenshot

Schedule Batch Logic:
BEFORE:
BEFORE - Schedule Batch Profiler Screenshot
AFTER:
AFTER - Schedule Batch Profiler Screenshot

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @sushah13, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request by @sushah13.

This PR focuses on introducing performance optimizations to two critical components of the SGLang serving stack: radix_cache.py and schedule_batch.py. The main goal is to reduce overall latency and improve throughput, particularly under high request loads and with large prompts.

The changes target inefficiencies in the KV cache eviction logic within the RadixCache and optimize data handling and tensor creation within the schedule_batch process. The author reports significant speedups in the evict method (~60%) and overall latency reductions after applying these changes.

Highlights

  • Radix Cache Eviction Optimization: The eviction logic in radix_cache.py has been significantly optimized. Instead of rebuilding the eviction heap (heapq.heapify) and traversing the tree (_collect_leaves) on every eviction call, a persistent min-heap (self.evict_heap) is now maintained. Nodes are added to this heap incrementally when they become evictable leaves (e.g., when lock references drop to zero or during insertion). The evict method now pops from this persistent heap, skipping locked or already-evicted nodes, leading to substantial performance gains.
  • Faster Radix Cache Leaf Deletion: To speed up the _delete_leaf operation in radix_cache.py, each TreeNode now stores its child_key. This eliminates the need for a linear search through the parent's children dictionary to find the key corresponding to the node being deleted, allowing for direct deletion.
  • Schedule Batch Input Flattening Optimization: In schedule_batch.py, the inefficient sum(input_ids, []) method for flattening a list of lists of input IDs has been replaced with list(chain.from_iterable(input_ids)). This changes the operation from an O(n²) complexity (due to repeated list copying) to a more efficient linear-time operation.
  • Schedule Batch Tensor Creation Optimization: The creation of seq_lens_tensor and prefix_lens_tensor in schedule_batch.py now uses torch.as_tensor instead of torch.tensor. This can avoid unnecessary data copies if the input seq_lens or prefix_lens are already tensor-like objects, potentially offering minor performance improvements.

Changelog

Click here to see the changelog
  • python/sglang/srt/managers/schedule_batch.py
    • Import chain from itertools (line 41).
    • Replace torch.tensor(sum(input_ids, [])...) with torch.tensor(list(chain.from_iterable(input_ids))...) for efficient input ID flattening (lines 1139-1141).
    • Replace torch.tensor(seq_lens...) with torch.as_tensor(seq_lens...) (line 1142).
    • Replace torch.tensor(prefix_lens...) with torch.as_tensor(prefix_lens...) (lines 1145-1146).
  • python/sglang/srt/mem_cache/radix_cache.py
    • Add self.child_key = None attribute to TreeNode (line 55).
    • Add self.evict_heap = [] attribute to RadixCache (line 114).
    • Clear self.evict_heap in the reset method (line 139).
    • Rewrite the evict method to use the persistent self.evict_heap, removing the need for _collect_leaves and heapq.heapify on each call. The loop now pops from the heap and checks node validity/locks (lines 270-292).
    • Push the node to self.evict_heap in dec_lock_ref if it becomes an evictable leaf (lines 318-319).
    • Assign child_key to the new node and the split child node in _split_node (lines 378, 382).
    • Use a single now variable for updating last_access_time in _insert_helper (lines 393, 403).
    • Assign child_key and push the new node to self.evict_heap in _insert_helper if it is an evictable leaf (lines 421, 425-426).
    • Modify _delete_leaf to use the stored node.child_key for direct deletion from the parent's children dictionary, removing the linear search (lines 448-449).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several well-motivated performance optimizations in radix_cache.py and schedule_batch.py. The changes in schedule_batch.py for list flattening and tensor creation are clear improvements. The modifications to radix_cache.py, particularly the eviction logic and leaf deletion, are more substantial and aim to significantly reduce latency.

Overall, the optimizations are promising. However, there's a potential issue with how evictable_size_ is updated in _delete_leaf, which could impact cache size accounting. Additionally, given the significant changes to the caching logic, adding unit tests for RadixCache would be highly beneficial to ensure correctness and prevent regressions. Updating documentation for the new members and modified logic in RadixCache would also improve maintainability.

Summary of Findings

  • Cache Size Accounting: There's a potential issue in radix_cache.py where evictable_size_ might be incorrectly updated during node eviction. Specifically, in _delete_leaf, it's decremented by len(node.key), while it should reflect changes in len(node.value). This should likely be handled in the evict method by decrementing evictable_size_ by node_value_len after the node's value is freed.
  • Unit Testing: The changes to RadixCache involve significant modifications to core caching logic (eviction, node splitting, deletion). It's highly recommended to add comprehensive unit tests for these functionalities to ensure correctness, cover edge cases, and prevent future regressions.
  • Documentation: Consider updating docstrings for the modified methods in RadixCache (e.g., evict, _split_node, _delete_leaf) and for new class members like evict_heap and child_key to reflect the new logic and their roles. This will help with code clarity and maintainability.
  • Redundant Code: In python/sglang/srt/mem_cache/radix_cache.py, within the _split_node method, there's a redundant assignment child.parent = new_node at line 384 (RIGHT side). This was already set at line 380. This is a minor issue with no functional impact but could be cleaned up.

Merge Readiness

The pull request introduces valuable performance optimizations. However, there is a high-severity concern regarding the evictable_size_ accounting in radix_cache.py that needs to be addressed. Additionally, due to the critical nature of caching logic, adding unit tests for the RadixCache modifications is strongly recommended before merging. Updating relevant documentation would also be beneficial.

I am unable to approve this pull request directly. Please ensure these points are considered and have other reviewers approve the changes before merging.

@xiezhq-hermann
Copy link
Collaborator

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

@sushah13
Copy link
Author

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

Sure thing - let me update the PR tonight with changes in Hiradix cache as well. Thank you!

@sushah13
Copy link
Author

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

Let me do the hiradixtree optimizations in the different PR. I want to do some more detailed analysis and comparison before I publish those changes (can do it right after this is merged). For now, I updated hiradix_cache to work along with radix_cache optimizations.

self.token_to_kv_pool_allocator.free(node.value)
node.value = None # Mark node as evicted
# Update evictable size here since value is freed;
# do not update again in _delete_leaf()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to update node.value to None as it is to be deleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If the node is immediately deleted and garbage-collected, setting node.value = None is redundant. However, if _delete_leaf() doesn’t fully nullify the node (or deletion is conditional), it may be safer to clear explicitly to avoid stale references.

node.value = None # Mark node as evicted
# Update evictable size here since value is freed;
# do not update again in _delete_leaf()
self.evictable_size_ -= node_value_len
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this in _delete_leaf still which is a clear semantic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree. The evictable_size_ tracks memory currently freeable. Adjusting it before _delete_leaf() prevents double-decrement bugs if _delete_leaf() also triggers eviction logic or tracking.

I would suggest leaving the decrement where it is (immediately after freeing). Moving it to _delete_leaf() risks coupling unrelated concerns (memory vs. tree structure).

@@ -82,6 +82,9 @@ def get_height(self, node: TreeNode):
return height

def write_backup(self, node: TreeNode, write_back=False):
# Nothing to copy if the node is already evicted.
if node.value is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is necessary to prevent redundant or incorrect backup logic:

  • If node.value is None, it’s already evicted. There's literally nothing on device to copy to host.
  • Continuing without this check would likely lead to:
    • Unnecessary allocation on host
    • Potential crash in self.cache_controller.write() expecting valid device indices

node.value = None # mark as evicted

# Remove the node only if it has no (even evicted) children
if len(node.children) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be determined already

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree (conditionally) — If evict() guarantees a node is a leaf before calling _delete_leaf, the check is redundant. But if eviction can happen preemptively on non-leaf nodes, this check is necessary. I am not confident in evict() ensuring leaf-ness, would prefer keeping the check.

@xiezhq-hermann
Copy link
Collaborator

@sushah13 thanks again for the PR. I think we should revoke the change about hicache for now and the rest on the base radix_cache looks good to me.

@sushah13
Copy link
Author

@sushah13 thanks again for the PR. I think we should revoke the change about hicache for now and the rest on the base radix_cache looks good to me.

The hiradix_cache changes aren't optimizations. They are to compliment the changes in radix cache so that the functionality doesn't break.

if x.lock_ref > 0:
while self.evict_heap and num_evicted < num_tokens:
node = heapq.heappop(self.evict_heap)
if node.lock_ref > 0 or node == self.root_node or node.evicted:
Copy link
Collaborator

@hanming-lu hanming-lu Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to be

if node.lock_ref > 0:
  continue
assert node != self.root_node

node.evicted is undefined.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after looking at the logic more, I think you are referring to node.value is None here instead of node.evicted

Copy link
Collaborator

@hanming-lu hanming-lu Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there's an implicit assumption that child's last_access_time is always older than parent here because the heap also contains internal nodes. the assumption doesn't seem to hold? should also check if the node is a leaf node, if not, continue.

break
del node.parent.children[k]
self.evictable_size_ -= len(node.key)
if node.child_key in node.parent.children:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if is not needed, instead it should be an assert, after checking node.value is None properly.

@hanming-lu
Copy link
Collaborator

hanming-lu commented Jul 24, 2025

Could you please provide the details on how you evaluated the performance gain for radix cache so we can repro? thanks.

@merrymercy
Copy link
Contributor

merrymercy commented Aug 3, 2025

Thanks for the contribution, but we will close this one. Feel free to reopen if you think it is still necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants