-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Minor Optimizations in Radix Cache and Schedule Batch #6907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @sushah13, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request by @sushah13.
This PR focuses on introducing performance optimizations to two critical components of the SGLang serving stack: radix_cache.py
and schedule_batch.py
. The main goal is to reduce overall latency and improve throughput, particularly under high request loads and with large prompts.
The changes target inefficiencies in the KV cache eviction logic within the RadixCache
and optimize data handling and tensor creation within the schedule_batch
process. The author reports significant speedups in the evict
method (~60%) and overall latency reductions after applying these changes.
Highlights
- Radix Cache Eviction Optimization: The eviction logic in
radix_cache.py
has been significantly optimized. Instead of rebuilding the eviction heap (heapq.heapify
) and traversing the tree (_collect_leaves
) on every eviction call, a persistent min-heap (self.evict_heap
) is now maintained. Nodes are added to this heap incrementally when they become evictable leaves (e.g., when lock references drop to zero or during insertion). Theevict
method now pops from this persistent heap, skipping locked or already-evicted nodes, leading to substantial performance gains. - Faster Radix Cache Leaf Deletion: To speed up the
_delete_leaf
operation inradix_cache.py
, eachTreeNode
now stores itschild_key
. This eliminates the need for a linear search through the parent's children dictionary to find the key corresponding to the node being deleted, allowing for direct deletion. - Schedule Batch Input Flattening Optimization: In
schedule_batch.py
, the inefficientsum(input_ids, [])
method for flattening a list of lists of input IDs has been replaced withlist(chain.from_iterable(input_ids))
. This changes the operation from an O(n²) complexity (due to repeated list copying) to a more efficient linear-time operation. - Schedule Batch Tensor Creation Optimization: The creation of
seq_lens_tensor
andprefix_lens_tensor
inschedule_batch.py
now usestorch.as_tensor
instead oftorch.tensor
. This can avoid unnecessary data copies if the inputseq_lens
orprefix_lens
are already tensor-like objects, potentially offering minor performance improvements.
Changelog
Click here to see the changelog
- python/sglang/srt/managers/schedule_batch.py
- Import
chain
fromitertools
(line 41). - Replace
torch.tensor(sum(input_ids, [])...)
withtorch.tensor(list(chain.from_iterable(input_ids))...)
for efficient input ID flattening (lines 1139-1141). - Replace
torch.tensor(seq_lens...)
withtorch.as_tensor(seq_lens...)
(line 1142). - Replace
torch.tensor(prefix_lens...)
withtorch.as_tensor(prefix_lens...)
(lines 1145-1146).
- Import
- python/sglang/srt/mem_cache/radix_cache.py
- Add
self.child_key = None
attribute toTreeNode
(line 55). - Add
self.evict_heap = []
attribute toRadixCache
(line 114). - Clear
self.evict_heap
in thereset
method (line 139). - Rewrite the
evict
method to use the persistentself.evict_heap
, removing the need for_collect_leaves
andheapq.heapify
on each call. The loop now pops from the heap and checks node validity/locks (lines 270-292). - Push the node to
self.evict_heap
indec_lock_ref
if it becomes an evictable leaf (lines 318-319). - Assign
child_key
to the new node and the split child node in_split_node
(lines 378, 382). - Use a single
now
variable for updatinglast_access_time
in_insert_helper
(lines 393, 403). - Assign
child_key
and push the new node toself.evict_heap
in_insert_helper
if it is an evictable leaf (lines 421, 425-426). - Modify
_delete_leaf
to use the storednode.child_key
for direct deletion from the parent's children dictionary, removing the linear search (lines 448-449).
- Add
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several well-motivated performance optimizations in radix_cache.py
and schedule_batch.py
. The changes in schedule_batch.py
for list flattening and tensor creation are clear improvements. The modifications to radix_cache.py
, particularly the eviction logic and leaf deletion, are more substantial and aim to significantly reduce latency.
Overall, the optimizations are promising. However, there's a potential issue with how evictable_size_
is updated in _delete_leaf
, which could impact cache size accounting. Additionally, given the significant changes to the caching logic, adding unit tests for RadixCache
would be highly beneficial to ensure correctness and prevent regressions. Updating documentation for the new members and modified logic in RadixCache
would also improve maintainability.
Summary of Findings
- Cache Size Accounting: There's a potential issue in
radix_cache.py
whereevictable_size_
might be incorrectly updated during node eviction. Specifically, in_delete_leaf
, it's decremented bylen(node.key)
, while it should reflect changes inlen(node.value)
. This should likely be handled in theevict
method by decrementingevictable_size_
bynode_value_len
after the node's value is freed. - Unit Testing: The changes to
RadixCache
involve significant modifications to core caching logic (eviction, node splitting, deletion). It's highly recommended to add comprehensive unit tests for these functionalities to ensure correctness, cover edge cases, and prevent future regressions. - Documentation: Consider updating docstrings for the modified methods in
RadixCache
(e.g.,evict
,_split_node
,_delete_leaf
) and for new class members likeevict_heap
andchild_key
to reflect the new logic and their roles. This will help with code clarity and maintainability. - Redundant Code: In
python/sglang/srt/mem_cache/radix_cache.py
, within the_split_node
method, there's a redundant assignmentchild.parent = new_node
at line 384 (RIGHT side). This was already set at line 380. This is a minor issue with no functional impact but could be cleaned up.
Merge Readiness
The pull request introduces valuable performance optimizations. However, there is a high-severity concern regarding the evictable_size_
accounting in radix_cache.py
that needs to be addressed. Additionally, due to the critical nature of caching logic, adding unit tests for the RadixCache
modifications is strongly recommended before merging. Updating relevant documentation would also be beneficial.
I am unable to approve this pull request directly. Please ensure these points are considered and have other reviewers approve the changes before merging.
Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side. |
Sure thing - let me update the PR tonight with changes in Hiradix cache as well. Thank you! |
Let me do the hiradixtree optimizations in the different PR. I want to do some more detailed analysis and comparison before I publish those changes (can do it right after this is merged). For now, I updated hiradix_cache to work along with radix_cache optimizations. |
self.token_to_kv_pool_allocator.free(node.value) | ||
node.value = None # Mark node as evicted | ||
# Update evictable size here since value is freed; | ||
# do not update again in _delete_leaf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to update node.value to None as it is to be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If the node is immediately deleted and garbage-collected, setting node.value = None is redundant. However, if _delete_leaf() doesn’t fully nullify the node (or deletion is conditional), it may be safer to clear explicitly to avoid stale references.
node.value = None # Mark node as evicted | ||
# Update evictable size here since value is freed; | ||
# do not update again in _delete_leaf() | ||
self.evictable_size_ -= node_value_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this in _delete_leaf still which is a clear semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would disagree. The evictable_size_ tracks memory currently freeable. Adjusting it before _delete_leaf() prevents double-decrement bugs if _delete_leaf() also triggers eviction logic or tracking.
I would suggest leaving the decrement where it is (immediately after freeing). Moving it to _delete_leaf() risks coupling unrelated concerns (memory vs. tree structure).
@@ -82,6 +82,9 @@ def get_height(self, node: TreeNode): | |||
return height | |||
|
|||
def write_backup(self, node: TreeNode, write_back=False): | |||
# Nothing to copy if the node is already evicted. | |||
if node.value is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is necessary to prevent redundant or incorrect backup logic:
- If node.value is None, it’s already evicted. There's literally nothing on device to copy to host.
- Continuing without this check would likely lead to:
- Unnecessary allocation on host
- Potential crash in self.cache_controller.write() expecting valid device indices
node.value = None # mark as evicted | ||
|
||
# Remove the node only if it has no (even evicted) children | ||
if len(node.children) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be determined already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree (conditionally) — If evict() guarantees a node is a leaf before calling _delete_leaf, the check is redundant. But if eviction can happen preemptively on non-leaf nodes, this check is necessary. I am not confident in evict() ensuring leaf-ness, would prefer keeping the check.
@sushah13 thanks again for the PR. I think we should revoke the change about hicache for now and the rest on the base radix_cache looks good to me. |
The hiradix_cache changes aren't optimizations. They are to compliment the changes in radix cache so that the functionality doesn't break. |
if x.lock_ref > 0: | ||
while self.evict_heap and num_evicted < num_tokens: | ||
node = heapq.heappop(self.evict_heap) | ||
if node.lock_ref > 0 or node == self.root_node or node.evicted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to be
if node.lock_ref > 0:
continue
assert node != self.root_node
node.evicted
is undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after looking at the logic more, I think you are referring to node.value is None
here instead of node.evicted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, there's an implicit assumption that child's last_access_time is always older than parent here because the heap also contains internal nodes. the assumption doesn't seem to hold? should also check if the node is a leaf node, if not, continue.
break | ||
del node.parent.children[k] | ||
self.evictable_size_ -= len(node.key) | ||
if node.child_key in node.parent.children: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if is not needed, instead it should be an assert, after checking node.value is None
properly.
Could you please provide the details on how you evaluated the performance gain for radix cache so we can repro? thanks. |
Thanks for the contribution, but we will close this one. Feel free to reopen if you think it is still necessary.
|
Motivation
This PR introduces performance optimizations to two key components of the SGLang serving stack:
radix_cache.py
andschedule_batch.py
. The primary goal is to reduce latency and improve throughput by eliminating inefficient data structures and unnecessary computations during request processing and KV cache eviction.Modifications
Optimizations in radix_cache.py
1. Reduced latency via optimized eviction logic
Before: Average latency over 12 requests: 0.577s
After: Average latency over 12 requests: 0.495s
Evict Method Improvements (~60% speedup, from 31ms → 12ms):
Avoid rebuilding the heap with
heapq.heapify()
on every evictionAvoid full-tree traversal in
_collect_leaves()
Avoid repeated heap rebalancing
Implementation Details:
2. Faster leaf deletion via key tracking
Previously, _delete_leaf() required a linear search in the parent’s children dict to find the key corresponding to the child node.
✅ Now: Each TreeNode stores its child_key upon insertion, eliminating the need to search during deletion.
Optimizations in schedule_batch.py
Before: Average latency over 12 requests: 0.495s
After: Average latency over 12 requests: 0.475s
Improvements:
Flattening input IDs efficiently
❌ Before:
sum(input_ids, [])
(O(n²) operation due to repeated list copying).✅ Now:
list(chain.from_iterable(input_ids))
for linear-time flattening.Avoid redundant tensor construction
❌ Before:
torch.tensor(seq_lens, ...)
may create unnecessary copies.✅ Now:
torch.as_tensor(seq_lens, dtype=torch.int64, device=self.device)
avoids extra copy if input is already a tensor-like object.These optimizations contribute to overall reduced latency and improved batch throughput, especially under high request volume and large prompt size settings.
Checklist
##Profiling Screenshots:


Radix Cache:
BEFORE:
AFTER:
Schedule Batch Logic:


BEFORE:
AFTER: