Minor Optimizations in Radix Cache and Schedule Batch #6907

sushah13 · 2025-06-05T23:16:32Z

Motivation

This PR introduces performance optimizations to two key components of the SGLang serving stack: radix_cache.py and schedule_batch.py. The primary goal is to reduce latency and improve throughput by eliminating inefficient data structures and unnecessary computations during request processing and KV cache eviction.

Modifications

Optimizations in radix_cache.py

1. Reduced latency via optimized eviction logic
Before: Average latency over 12 requests: 0.577s
After: Average latency over 12 requests: 0.495s

Evict Method Improvements (~60% speedup, from 31ms → 12ms):

Avoid rebuilding the heap with heapq.heapify() on every eviction
- Previously, the eviction heap was rebuilt from scratch each time (O(n) cost).
- ✅ Now: Maintain a persistent min-heap (self.evict_heap) and incrementally update it during insertions, lock/unlock updates, and deletions.
Avoid full-tree traversal in _collect_leaves()
- Tree traversal for collecting evictable leaves was costly and unnecessary on each eviction.
- ✅ Now: The persistent heap contains all evictable leaves, eliminating the need for traversal.
Avoid repeated heap rebalancing
- Each heapq.heappop() call triggered rebalancing.
- ✅ Now: evict() loop pops only unlocked and valid nodes using a guard, minimizing operations.

Implementation Details:

Added self.evict_heap: List[Tuple[int, TreeNode]] to RadixCache.
Heap is updated incrementally:
- insert(): Add leaf if evictable.
- inc_lock_ref(): Remove node from heap if it becomes locked.
- dec_lock_ref(): Re-add node if it becomes unlockable.
- _delete_leaf(): No-op for heap; eviction logic handles parent-child promotion.

2. Faster leaf deletion via key tracking
Previously, _delete_leaf() required a linear search in the parent’s children dict to find the key corresponding to the child node.
✅ Now: Each TreeNode stores its child_key upon insertion, eliminating the need to search during deletion.

Optimizations in schedule_batch.py

Before: Average latency over 12 requests: 0.495s
After: Average latency over 12 requests: 0.475s

Improvements:

Flattening input IDs efficiently
❌ Before: sum(input_ids, []) (O(n²) operation due to repeated list copying).
✅ Now: list(chain.from_iterable(input_ids)) for linear-time flattening.
Avoid redundant tensor construction
❌ Before: torch.tensor(seq_lens, ...) may create unnecessary copies.
✅ Now: torch.as_tensor(seq_lens, dtype=torch.int64, device=self.device) avoids extra copy if input is already a tensor-like object.

These optimizations contribute to overall reduced latency and improved batch throughput, especially under high request volume and large prompt size settings.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

##Profiling Screenshots:
Radix Cache:
BEFORE:

AFTER:

Schedule Batch Logic:
BEFORE:

AFTER:

gemini-code-assist

Hello @sushah13, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request by @sushah13.

This PR focuses on introducing performance optimizations to two critical components of the SGLang serving stack: radix_cache.py and schedule_batch.py. The main goal is to reduce overall latency and improve throughput, particularly under high request loads and with large prompts.

The changes target inefficiencies in the KV cache eviction logic within the RadixCache and optimize data handling and tensor creation within the schedule_batch process. The author reports significant speedups in the evict method (~60%) and overall latency reductions after applying these changes.

Highlights

Radix Cache Eviction Optimization: The eviction logic in radix_cache.py has been significantly optimized. Instead of rebuilding the eviction heap (heapq.heapify) and traversing the tree (_collect_leaves) on every eviction call, a persistent min-heap (self.evict_heap) is now maintained. Nodes are added to this heap incrementally when they become evictable leaves (e.g., when lock references drop to zero or during insertion). The evict method now pops from this persistent heap, skipping locked or already-evicted nodes, leading to substantial performance gains.
Faster Radix Cache Leaf Deletion: To speed up the _delete_leaf operation in radix_cache.py, each TreeNode now stores its child_key. This eliminates the need for a linear search through the parent's children dictionary to find the key corresponding to the node being deleted, allowing for direct deletion.
Schedule Batch Input Flattening Optimization: In schedule_batch.py, the inefficient sum(input_ids, []) method for flattening a list of lists of input IDs has been replaced with list(chain.from_iterable(input_ids)). This changes the operation from an O(n²) complexity (due to repeated list copying) to a more efficient linear-time operation.
Schedule Batch Tensor Creation Optimization: The creation of seq_lens_tensor and prefix_lens_tensor in schedule_batch.py now uses torch.as_tensor instead of torch.tensor. This can avoid unnecessary data copies if the input seq_lens or prefix_lens are already tensor-like objects, potentially offering minor performance improvements.

Changelog

Click here to see the changelog

python/sglang/srt/managers/schedule_batch.py
- Import chain from itertools (line 41).
- Replace torch.tensor(sum(input_ids, [])...) with torch.tensor(list(chain.from_iterable(input_ids))...) for efficient input ID flattening (lines 1139-1141).
- Replace torch.tensor(seq_lens...) with torch.as_tensor(seq_lens...) (line 1142).
- Replace torch.tensor(prefix_lens...) with torch.as_tensor(prefix_lens...) (lines 1145-1146).
python/sglang/srt/mem_cache/radix_cache.py
- Add self.child_key = None attribute to TreeNode (line 55).
- Add self.evict_heap = [] attribute to RadixCache (line 114).
- Clear self.evict_heap in the reset method (line 139).
- Rewrite the evict method to use the persistent self.evict_heap, removing the need for _collect_leaves and heapq.heapify on each call. The loop now pops from the heap and checks node validity/locks (lines 270-292).
- Push the node to self.evict_heap in dec_lock_ref if it becomes an evictable leaf (lines 318-319).
- Assign child_key to the new node and the split child node in _split_node (lines 378, 382).
- Use a single now variable for updating last_access_time in _insert_helper (lines 393, 403).
- Assign child_key and push the new node to self.evict_heap in _insert_helper if it is an evictable leaf (lines 421, 425-426).
- Modify _delete_leaf to use the stored node.child_key for direct deletion from the parent's children dictionary, removing the linear search (lines 448-449).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several well-motivated performance optimizations in radix_cache.py and schedule_batch.py. The changes in schedule_batch.py for list flattening and tensor creation are clear improvements. The modifications to radix_cache.py, particularly the eviction logic and leaf deletion, are more substantial and aim to significantly reduce latency.

Overall, the optimizations are promising. However, there's a potential issue with how evictable_size_ is updated in _delete_leaf, which could impact cache size accounting. Additionally, given the significant changes to the caching logic, adding unit tests for RadixCache would be highly beneficial to ensure correctness and prevent regressions. Updating documentation for the new members and modified logic in RadixCache would also improve maintainability.

Summary of Findings

Cache Size Accounting: There's a potential issue in radix_cache.py where evictable_size_ might be incorrectly updated during node eviction. Specifically, in _delete_leaf, it's decremented by len(node.key), while it should reflect changes in len(node.value). This should likely be handled in the evict method by decrementing evictable_size_ by node_value_len after the node's value is freed.
Unit Testing: The changes to RadixCache involve significant modifications to core caching logic (eviction, node splitting, deletion). It's highly recommended to add comprehensive unit tests for these functionalities to ensure correctness, cover edge cases, and prevent future regressions.
Documentation: Consider updating docstrings for the modified methods in RadixCache (e.g., evict, _split_node, _delete_leaf) and for new class members like evict_heap and child_key to reflect the new logic and their roles. This will help with code clarity and maintainability.
Redundant Code: In python/sglang/srt/mem_cache/radix_cache.py, within the _split_node method, there's a redundant assignment child.parent = new_node at line 384 (RIGHT side). This was already set at line 380. This is a minor issue with no functional impact but could be cleaned up.

Merge Readiness

The pull request introduces valuable performance optimizations. However, there is a high-severity concern regarding the evictable_size_ accounting in radix_cache.py that needs to be addressed. Additionally, due to the critical nature of caching logic, adding unit tests for the RadixCache modifications is strongly recommended before merging. Updating relevant documentation would also be beneficial.

I am unable to approve this pull request directly. Please ensure these points are considered and have other reviewers approve the changes before merging.

python/sglang/srt/mem_cache/radix_cache.py

xiezhq-hermann · 2025-06-12T20:07:37Z

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

sushah13 · 2025-06-12T21:03:10Z

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

Sure thing - let me update the PR tonight with changes in Hiradix cache as well. Thank you!

sushah13 · 2025-06-13T00:34:40Z

Thank you for this PR, this is indeed something we wanted to accelerate the eviction. Would you mind making the same changes to hiradixtree as well? It's sure ok if you don't have the bandwidth then I can incorporate it on that side.

Let me do the hiradixtree optimizations in the different PR. I want to do some more detailed analysis and comparison before I publish those changes (can do it right after this is merged). For now, I updated hiradix_cache to work along with radix_cache optimizations.

xiezhq-hermann · 2025-06-14T03:45:43Z

python/sglang/srt/mem_cache/radix_cache.py

+            self.token_to_kv_pool_allocator.free(node.value)
+            node.value = None  # Mark node as evicted
+            # Update evictable size here since value is freed;
+            # do not update again in _delete_leaf()


No need to update node.value to None as it is to be deleted

Agreed. If the node is immediately deleted and garbage-collected, setting node.value = None is redundant. However, if _delete_leaf() doesn’t fully nullify the node (or deletion is conditional), it may be safer to clear explicitly to avoid stale references.

xiezhq-hermann · 2025-06-14T03:47:16Z

python/sglang/srt/mem_cache/radix_cache.py

+            node.value = None  # Mark node as evicted
+            # Update evictable size here since value is freed;
+            # do not update again in _delete_leaf()
+            self.evictable_size_ -= node_value_len


I think we can keep this in _delete_leaf still which is a clear semantic.

I would disagree. The evictable_size_ tracks memory currently freeable. Adjusting it before _delete_leaf() prevents double-decrement bugs if _delete_leaf() also triggers eviction logic or tracking.

I would suggest leaving the decrement where it is (immediately after freeing). Moving it to _delete_leaf() risks coupling unrelated concerns (memory vs. tree structure).

xiezhq-hermann · 2025-06-14T03:49:28Z

python/sglang/srt/mem_cache/hiradix_cache.py

@@ -82,6 +82,9 @@ def get_height(self, node: TreeNode):
        return height

    def write_backup(self, node: TreeNode, write_back=False):
+        # Nothing to copy if the node is already evicted.
+        if node.value is None:


I think there is no need for this.

This check is necessary to prevent redundant or incorrect backup logic:

If node.value is None, it’s already evicted. There's literally nothing on device to copy to host.

Continuing without this check would likely lead to:

Unnecessary allocation on host

Potential crash in self.cache_controller.write() expecting valid device indices

xiezhq-hermann · 2025-06-14T03:51:39Z

python/sglang/srt/mem_cache/hiradix_cache.py

+        node.value = None  # mark as evicted
+
+        # Remove the node only if it has no (even evicted) children
+        if len(node.children) == 0:


this should be determined already

I agree (conditionally) — If evict() guarantees a node is a leaf before calling _delete_leaf, the check is redundant. But if eviction can happen preemptively on non-leaf nodes, this check is necessary. I am not confident in evict() ensuring leaf-ness, would prefer keeping the check.

xiezhq-hermann · 2025-06-14T03:55:30Z

@sushah13 thanks again for the PR. I think we should revoke the change about hicache for now and the rest on the base radix_cache looks good to me.

sushah13 · 2025-06-16T16:48:00Z

@sushah13 thanks again for the PR. I think we should revoke the change about hicache for now and the rest on the base radix_cache looks good to me.

The hiradix_cache changes aren't optimizations. They are to compliment the changes in radix cache so that the functionality doesn't break.

hanming-lu · 2025-07-23T02:20:27Z

python/sglang/srt/mem_cache/radix_cache.py

-            if x.lock_ref > 0:
+        while self.evict_heap and num_evicted < num_tokens:
+            node = heapq.heappop(self.evict_heap)
+            if node.lock_ref > 0 or node == self.root_node or node.evicted:


suggest to be

if node.lock_ref > 0: continue assert node != self.root_node

node.evicted is undefined.

after looking at the logic more, I think you are referring to node.value is None here instead of node.evicted

IIUC, there's an implicit assumption that child's last_access_time is always older than parent here because the heap also contains internal nodes. the assumption doesn't seem to hold? should also check if the node is a leaf node, if not, continue.

hanming-lu · 2025-07-23T21:54:51Z

python/sglang/srt/mem_cache/radix_cache.py

-                break
-        del node.parent.children[k]
-        self.evictable_size_ -= len(node.key)
+        if node.child_key in node.parent.children:


this if is not needed, instead it should be an assert, after checking node.value is None properly.

hanming-lu · 2025-07-24T22:20:04Z

Could you please provide the details on how you evaluated the performance gain for radix cache so we can repro? thanks.

merrymercy · 2025-08-03T04:21:32Z

Thanks for the contribution, but we will close this one. Feel free to reopen if you think it is still necessary.

Please try this c++ implementation. It should be much faster [Feature] Radix Tree in C++ #7369
@hanming-lu tested this on our benchmark but did not found noticeable perf difference between this and main. Could you tell us the exact instructions on how to reproduce?
The change in schedule_batch.py is picked by Minor Optimizations in Schedule Batch #8724

sushah13 requested review from merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners June 5, 2025 23:16

gemini-code-assist bot reviewed Jun 5, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 5, 2025

View reviewed changes

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

sushah13 requested a review from zhyncs as a code owner June 6, 2025 16:50

sushah13 temporarily deployed to prod June 10, 2025 11:16 — with GitHub Actions Inactive

merrymercy added the high priority label Jun 10, 2025

merrymercy assigned xiezhq-hermann and Ying1123 Jun 10, 2025

surshah12 added 4 commits June 12, 2025 17:25

Minor Optimizations in Radix Cache and Schedule Batch

5ff1010

Changes introduced by Precommit script

1a8fa61

Addressing Gemini comments and adding unit tests

a2ce650

Fixing hiradix_cache to reflect changes from radix_cache

894d1c1

sushah13 force-pushed the optimization branch from 5bb7914 to 894d1c1 Compare June 13, 2025 00:32

Merge branch 'main' into optimization

af9a5ce

xiezhq-hermann reviewed Jun 14, 2025

View reviewed changes

DarkSharpness mentioned this pull request Jun 15, 2025

[Refactor] Rewrite Hierarchical Prefix Cache in C++ #7194

Closed

6 tasks

sushah13 requested a review from xiezhq-hermann June 16, 2025 17:21

DarkSharpness mentioned this pull request Jun 20, 2025

[Feature] Radix Tree in C++ #7369

Merged

6 tasks

merrymercy mentioned this pull request Jul 21, 2025

Development Roadmap (2025 H2) #7736

Open

1 task

hanming-lu reviewed Jul 23, 2025

View reviewed changes

xiezhq-hermann assigned hanming-lu Jul 31, 2025

merrymercy closed this Aug 3, 2025

merrymercy mentioned this pull request Aug 3, 2025

Minor Optimizations in Schedule Batch #8724

Merged

Minor Optimizations in Radix Cache and Schedule Batch #6907

Minor Optimizations in Radix Cache and Schedule Batch #6907

Uh oh!

Conversation

sushah13 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Optimizations in radix_cache.py

Optimizations in schedule_batch.py

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

xiezhq-hermann commented Jun 12, 2025

Uh oh!

sushah13 commented Jun 12, 2025

Uh oh!

sushah13 commented Jun 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiezhq-hermann commented Jun 14, 2025

Uh oh!

sushah13 commented Jun 16, 2025

Uh oh!

hanming-lu Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sushah13 commented Jun 5, 2025 •

edited

Loading

hanming-lu Jul 23, 2025 •

edited

Loading

hanming-lu Jul 23, 2025 •

edited

Loading

hanming-lu commented Jul 24, 2025 •

edited

Loading

merrymercy commented Aug 3, 2025 •

edited

Loading