-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Hierarchical Caching for SGLang #2693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It's amazing! Happy new year! |
vanilla and selective write through
… transfer overhead
2bd500a
to
1853cf2
Compare
DeepSeek MLA is not supported yet, and an error will be reported when starting the model:
|
Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. |
Thanks @xiezhq-hermann |
@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009 |
@xiezhq-hermann Thanks, but I've encountered a problem. I just experimented with #4009 and found that there is indeed a concurrency problem when TP>1. The program will enter a locked state. There may be a concurrency problem. Please follow up. Thank you! |
Besides |
Right now it allocate a host memory pool which is 4 times of the size of the device memory pool by default, so no need to set other things but more options will be added. |
Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu> Co-authored-by: Yineng Zhang <me@zhyncs.com>
Hi I'm wondering - when are you planning to support L3 cache? I think it's reasonable to support pluggable L3 caches, which encourages storage providers to implement their L3 caches according to their product features. What you need to do is to define a bunch of kv cache apis for getting/putting/evicting kv cache chunk/item and give them some demo implentation using something like local SSD. |
This is in works @wangyibin-gh ! |
when do you expect this feature can be merged? and btw is there any documentation about it, especially w.r.t the APIs. |
Motivation
While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.
To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.
Modifications
HiRadixCache
that extendsRadixCache
with host memory addresses and synchronization mechanisms.Todo:
Checklist