Hierarchical Caching for SGLang #2693

xiezhq-hermann · 2025-01-01T07:59:42Z

Motivation

While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.

To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.

Modifications

A HiRadixCache that extends RadixCache with host memory addresses and synchronization mechanisms.
A host memory pool that synchronizes with the device memory pool of KV caches.
A memory controller that implements efficient data transfer between host and device, and handles various cache write policies for hierarchical caching.

Todo:

Update benchmark results.
Remove deprecated design and implementation.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2025-01-01T08:01:17Z

It's amazing! Happy new year!

vanilla and selective write through

… transfer overhead

python/sglang/srt/mem_cache/radix_cache.py

lambert0312 · 2025-02-25T02:52:32Z

DeepSeek MLA is not supported yet, and an error will be reported when starting the model:

  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__
    HiRadixCache(
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__
    self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool)
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__
    self.head_num = device_pool.head_num
AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'

xiezhq-hermann · 2025-02-25T03:20:26Z

DeepSeek MLA is not supported yet, and an error will be reported when starting the model:

  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__
    HiRadixCache(
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__
    self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool)
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__
    self.head_num = device_pool.head_num
AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon.
For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

lambert0312 · 2025-02-25T03:30:06Z

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon.
For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

Thanks @xiezhq-hermann

xiezhq-hermann · 2025-03-04T09:09:49Z

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon.
For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

Thanks @xiezhq-hermann

@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009

lambert0312 · 2025-03-04T12:07:28Z

@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009

@xiezhq-hermann Thanks, but I've encountered a problem. I just experimented with #4009 and found that there is indeed a concurrency problem when TP>1. The program will enter a locked state. There may be a concurrency problem. Please follow up. Thank you!

shensimeteor · 2025-03-04T21:37:52Z

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

Besides --enable-hierarchical-cache, do we also need to set cpu_offload_gb?

xiezhq-hermann · 2025-03-05T09:09:27Z

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

Besides --enable-hierarchical-cache, do we also need to set cpu_offload_gb?

Right now it allocate a host memory pool which is 4 times of the size of the device memory pool by default, so no need to set other things but more options will be added.

Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu> Co-authored-by: Yineng Zhang <me@zhyncs.com>

wangyibin-gh · 2025-06-24T09:38:02Z

Hi I'm wondering - when are you planning to support L3 cache? I think it's reasonable to support pluggable L3 caches, which encourages storage providers to implement their L3 caches according to their product features. What you need to do is to define a bunch of kv cache apis for getting/putting/evicting kv cache chunk/item and give them some demo implentation using something like local SSD.

msharmavikram · 2025-06-24T12:25:22Z

This is in works @wangyibin-gh !

wangyibin-gh · 2025-06-25T02:54:56Z

This is in works @wangyibin-gh !

when do you expect this feature can be merged? and btw is there any documentation about it, especially w.r.t the APIs.

xiezhq-hermann requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 1, 2025 07:59

zhyncs added the enhancement New feature or request label Jan 1, 2025

xiezhq-hermann added 17 commits January 2, 2025 08:05

KV cache memory pool on host

f508c4a

hierarchical cache controller

0639ff5

radix tree for hierarchical cache

7de8fc5

minimal change to plug in hierarchical cache

217a9b2

remove duplicated code

09caf1c

hierarchiccal cache micro-benchmark

2d57a87

global CUDA synchronization to prevent illegal memory access

2007d1d

write through and back policies, deprecate write through revokable

d8b6b64

vanilla and selective write through

minor change on scheduler for hierarchical caching

8e68f71

fix rebase error

ff19db0

fix counter reset issue

1846672

bug fix for illegal memory access in pytorch indexing and reduce data…

2ff5297

… transfer overhead

draft multi turn benchmark

7bf5ff4

reorg multiturn benchmark

89fc497

bug fix for scheduler

191a02d

reorg test

81e39dc

fupdate format

1853cf2

xiezhq-hermann force-pushed the xiezhq-hierarchical branch from 2bd500a to 1853cf2 Compare January 2, 2025 08:06

merrymercy assigned Ying1123 Jan 2, 2025

merrymercy reviewed Jan 2, 2025

View reviewed changes

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

zhaohaidao reviewed Jan 4, 2025

View reviewed changes

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

bug fix for device of value

3a1e602

xiezhq-hermann and others added 7 commits February 15, 2025 15:43

Merge branch 'main' into xiezhq-hierarchical

481c912

Merge branch 'main' into xiezhq-hierarchical

46dea61

Merge branch 'main' into xiezhq-hierarchical

257e256

Merge branch 'main' into xiezhq-hierarchical

af62c95

Merge branch 'main' into xiezhq-hierarchical

996413c

Merge branch 'main' into xiezhq-hierarchical

5d714a8

Merge branch 'main' into xiezhq-hierarchical

a37244a

Ying1123 approved these changes Feb 24, 2025

View reviewed changes

Ying1123 merged commit 6c7a152 into main Feb 24, 2025
17 of 21 checks passed

Ying1123 deleted the xiezhq-hierarchical branch February 24, 2025 05:56

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

zhyncs mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

xiezhq-hermann mentioned this pull request Mar 5, 2025

Hierarchical Caching Refactoring and Fixing TP issue #4082

Merged

6 tasks

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

Hierarchical Caching for SGLang (sgl-project#2693)

8f02444

Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu> Co-authored-by: Yineng Zhang <me@zhyncs.com>

xinji1 mentioned this pull request Apr 3, 2025

[PD] Support KV transfer with mooncake #4880

Merged

6 tasks

Fridge003 mentioned this pull request Apr 8, 2025

Is flashinfer's "KV sequence parallelism" enabled in sglang? #5112

Closed

This was referenced Jun 17, 2025

Upstreaming hicache bug fixes #7267

Merged

Kernels for efficient KV cache IO #7313

Merged

xiezhq-hermann mentioned this pull request Jul 11, 2025

Hicache Storage Layer Prototype #7704

Merged

6 tasks

hubertlu-tw mentioned this pull request Jul 21, 2025

[AMD] Support Hierarchical Caching on AMD GPUs #8236

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hierarchical Caching for SGLang #2693

Hierarchical Caching for SGLang #2693

Uh oh!

xiezhq-hermann commented Jan 1, 2025 •

edited

Loading

Uh oh!

zhyncs commented Jan 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lambert0312 commented Feb 25, 2025

Uh oh!

xiezhq-hermann commented Feb 25, 2025

Uh oh!

lambert0312 commented Feb 25, 2025 •

edited

Loading

Uh oh!

xiezhq-hermann commented Mar 4, 2025 •

edited

Loading

Uh oh!

lambert0312 commented Mar 4, 2025

Uh oh!

shensimeteor commented Mar 4, 2025

Uh oh!

xiezhq-hermann commented Mar 5, 2025

Uh oh!

wangyibin-gh commented Jun 24, 2025

Uh oh!

msharmavikram commented Jun 24, 2025

Uh oh!

wangyibin-gh commented Jun 25, 2025

Uh oh!

Uh oh!

Hierarchical Caching for SGLang #2693

Hierarchical Caching for SGLang #2693

Uh oh!

Conversation

xiezhq-hermann commented Jan 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Todo:

Checklist

Uh oh!

zhyncs commented Jan 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lambert0312 commented Feb 25, 2025

Uh oh!

xiezhq-hermann commented Feb 25, 2025

Uh oh!

lambert0312 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiezhq-hermann commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lambert0312 commented Mar 4, 2025

Uh oh!

shensimeteor commented Mar 4, 2025

Uh oh!

xiezhq-hermann commented Mar 5, 2025

Uh oh!

wangyibin-gh commented Jun 24, 2025

Uh oh!

msharmavikram commented Jun 24, 2025

Uh oh!

wangyibin-gh commented Jun 25, 2025

Uh oh!

Uh oh!

xiezhq-hermann commented Jan 1, 2025 •

edited

Loading

lambert0312 commented Feb 25, 2025 •

edited

Loading

xiezhq-hermann commented Mar 4, 2025 •

edited

Loading