Hicache Storage Layer Prototype #7704

xiezhq-hermann · 2025-07-02T02:55:19Z

Motivation

This PR aims to introduce a standard storage interface for hierarchical KV caching (first introduced in #2693), so that the community can plug in different storage backend like Mooncake.

Recently, there are PR and proposals to integrate storage layer into the hierarchical KV caching eco-system:
#7211
#7280
#7896
#7920
#7576
#7746 (comment)
#7761 (comment)
and some more non-public inquery.

After extensive discussion, we decided to move forward with following plan:

Keep iterating and mataining a high performance Radix and HiRadix memory management backbone to prevent performance regression for the best performance.
Set up a standard storage interface so contributors from the community can easily develop and integrate their performant storage backend. For now, a minimal HiCacheFile backend is tested for demonstration purpose, and Mooncake integration is under active development. Please let us know if you plan to integrate another backend and any enhancement you might need for the interfaces.
Set up a few necessary hooks for standard scheduling policy development. For now, only a best-effort prefetching and hot spot write-through backup are supported. Please do let us know your need so we can clean up the interfaces for your need.

Some of other follow-up work:

Integrating full hicache functionality with PD disaggregation and router level scheduling.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @xiezhq-hermann, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a prototype for a hierarchical cache (Hicache) storage layer, enabling the persistence and prefetching of KV cache data. It defines a generic storage interface and provides a file-based implementation, integrating these capabilities into the existing cache management system. The changes also involve a significant refactoring of KV cache input/output operations between GPU and CPU, utilizing optimized kernel functions for improved efficiency.

Highlights

New Hicache Storage Layer: Introduced a persistent storage layer for KV cache, starting with a file-based implementation (HiCacheFile). This allows KV cache data to be stored on disk, enabling more efficient memory management and potentially larger context windows.
KV Cache Prefetching: Implemented a prefetching mechanism that can load KV cache data from the new storage layer into host memory asynchronously, anticipating future needs and reducing latency during model inference.
Refactored KV Cache I/O: Significantly refactored how KV cache data is transferred between device (GPU) and host (CPU) memory. This leverages new, optimized sgl-kernel functions for more efficient and unified load_from_host_per_layer and backup_to_host_all_layer operations, and standardizes buffer structures.
Hierarchical Cache Integration: The new storage and prefetching capabilities are deeply integrated into the existing HiCacheController and HiRadixCache components, enhancing the overall hierarchical caching system with disk-backed storage.
Dependency and Configuration Updates: Updated the sgl-kernel dependency to version 0.2.0 to support the new KV cache I/O functionalities and added new hicache_io_backend and hicache_storage_backend configuration options.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a prototype for a hierarchical cache (HiCache) storage layer, which is a significant feature. The changes add the capability to prefetch KV cache data from a persistent storage backend, which can improve performance for requests with long prefixes.

The overall structure is well-thought-out, with a clear separation of concerns between the storage backend, the cache controller, and the radix tree cache implementation. However, there are several critical issues related to correctness and thread safety that must be addressed before this can be merged. These include incorrect API usage, race conditions, and bugs in the prefetching logic.

I've left detailed comments on specific lines of code to address these issues. Once these are resolved, this will be a great addition to the project.

python/sglang/srt/managers/cache_controller.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/mem_cache/hiradix_cache.py

python/sglang/srt/mem_cache/memory_pool_host.py

python/sglang/srt/managers/cache_controller.py

python/sglang/srt/mem_cache/hiradix_cache.py

python/sglang/srt/mem_cache/radix_cache.py

python/sglang/srt/mem_cache/hicache_storage.py

didoteebin · 2025-07-20T13:40:54Z

@xiezhq-hermann Hi xiezhq , why do you think a three layer GPU / CPU / File based kv cache storage mechanism is better ? To my best understanding , remove cpu layer and direct GDS transfer between GPU and File would be faster

xiezhq-hermann · 2025-07-21T01:14:24Z

@xiezhq-hermann Hi xiezhq , why do you think a three layer GPU / CPU / File based kv cache storage mechanism is better ? To my best understanding , remove cpu layer and direct GDS transfer between GPU and File would be faster

Good point @didoteebin the rationale lies in different hardware characteristics, for most systems, the PCIe bandwidth could enable us to do layer-wise overlapping of KV cache transfer and forward computation, (i.e., concurrently execute forward of layer 1 and loading KV caches for layer 2) as PCIe-5 could achieve about 50gb/s and about 20gb/s for PCIe-4 and the latency is more predictable. However, for most local disks and remote storage, the latency would be much higher and throughput would be lower. As a result we implemented different policies across different layers. Specifically, layer-wise zero overhead overlapping between GPU and CPU memory, and best-effort prefetching for storage devices. But I sure can see different hardware platform impact the design choice, e.g., a fast RDMA remote memory pool could make direct access more benefitial and we will keep evolving our systems to fit more users' need.
Thanks again for your PR as well, would love to integrate direct access as an alternative in near future as we complete the integration of some popular backend like Mooncake and 3FS.

soyail · 2025-07-22T02:39:30Z

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

xiezhq-hermann · 2025-07-22T08:25:40Z

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

Charles-L-Chen · 2025-07-24T09:09:12Z

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

@xiezhq-hermann Is there any update on this issue?

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

Alisehen · 2025-08-11T12:27:16Z

Some questions need assistance in answering, thanks a lot
I've conducted benchmarks on this feature using the Deepseek R1 model with 16 H20 GPUs, where HiCacheFile is configured to use tmpfs. During the tests, I noticed that eviction performance might be causing watchdog timeouts. Are there any specific metrics or logging mechanisms that could help me analyze this issue in more detail? This set of test loads performs well in the version with only enable-hierarchical-cache enabled.
SGLang errorlog as below
params.SamplingParams object at 0x7f33ca0fe6e0>)], available_size=1160, evictable_size=361907,
2025-07-15 17:47:33 - ERROR - Pyspy failed to dump PID 566194. Error: /bin/dash: 1: py-spy: not found

2025-07-15 17:47:33 - ERROR - Watchdog timeout (self.watchdog_timeout=300)
FileSystem info
tmpfs           512G   36G  477G   7% /tmp/hicache
The startup script for one of the sglang nodes is as follows
GLOO_SOCKET_IFNAME=eth0 \
NCCL_SOCKET_IFNAME=eth0 \
NCCL_IB_GID_INDEX=3 \
NCCL_IB_HCA=mlx5_ \
NCCL_IB_DISABLE=0 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
MODEL_LENGTH=131072  \
TORCHINDUCTOR_FX_GRAPH_CACHE=1 \
TORCHINDUCTOR_AUTOGRAD_CACHE=1 \
TORCHINDUCTOR_CACHE_DIR="/data00/torch_compile/" \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
  --cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 48 56 64 72 76 78 80 82 \
  --cuda-graph-max-bs 82 \
  --attention-backend fa3 \
  --speculative-algo NEXTN \
  --speculative-draft /data00/DeepSeek-R1-NextN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --model-path /data00/DeepSeek-R1 \
  --tp 16 \
  --dist-init-addr 192.168.0.29:20000 \
  --nnodes 2 \
  --node-rank 0 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --enable-ep-moe \
  --max-running-requests 82  \
  --disable-chunked-prefix-cache \
  --enable-hierarchical-cache \
  --host 0.0.0.0 \
  --port 8080 \
  --hicache-storage-backend file
Hi @yapple, TP is not fully supported yet as there are still some designs about config details to be figured out. Will fix it ASAP : )

hello, has this problem been fixed?

xiezhq-hermann · 2025-08-11T21:08:55Z

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

@xiezhq-hermann Is there any update on this issue?

@Charles-L-Chen the latest main has adopted a new backend selection mechanism, by default it will use fa3 for prefill and flashinfer for decoding to avoid this problem.

xiezhq-hermann · 2025-08-11T21:10:15Z

Some questions need assistance in answering, thanks a lot
I've conducted benchmarks on this feature using the Deepseek R1 model with 16 H20 GPUs, where HiCacheFile is configured to use tmpfs. During the tests, I noticed that eviction performance might be causing watchdog timeouts. Are there any specific metrics or logging mechanisms that could help me analyze this issue in more detail? This set of test loads performs well in the version with only enable-hierarchical-cache enabled.
SGLang errorlog as below
params.SamplingParams object at 0x7f33ca0fe6e0>)], available_size=1160, evictable_size=361907,
2025-07-15 17:47:33 - ERROR - Pyspy failed to dump PID 566194. Error: /bin/dash: 1: py-spy: not found

2025-07-15 17:47:33 - ERROR - Watchdog timeout (self.watchdog_timeout=300)
FileSystem info
tmpfs           512G   36G  477G   7% /tmp/hicache
The startup script for one of the sglang nodes is as follows
GLOO_SOCKET_IFNAME=eth0 \
NCCL_SOCKET_IFNAME=eth0 \
NCCL_IB_GID_INDEX=3 \
NCCL_IB_HCA=mlx5_ \
NCCL_IB_DISABLE=0 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
MODEL_LENGTH=131072  \
TORCHINDUCTOR_FX_GRAPH_CACHE=1 \
TORCHINDUCTOR_AUTOGRAD_CACHE=1 \
TORCHINDUCTOR_CACHE_DIR="/data00/torch_compile/" \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
  --cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 48 56 64 72 76 78 80 82 \
  --cuda-graph-max-bs 82 \
  --attention-backend fa3 \
  --speculative-algo NEXTN \
  --speculative-draft /data00/DeepSeek-R1-NextN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --model-path /data00/DeepSeek-R1 \
  --tp 16 \
  --dist-init-addr 192.168.0.29:20000 \
  --nnodes 2 \
  --node-rank 0 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --enable-ep-moe \
  --max-running-requests 82  \
  --disable-chunked-prefix-cache \
  --enable-hierarchical-cache \
  --host 0.0.0.0 \
  --port 8080 \
  --hicache-storage-backend file
Hi @yapple, TP is not fully supported yet as there are still some designs about config details to be figured out. Will fix it ASAP : )
hello, has this problem been fixed?

@Alisehen the TP issues should have been resolved already in latest main for the file backend, would you mind take another try?

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

wqlxx · 2025-08-19T06:41:01Z

@xiezhq-hermann Do you have any plan to support hicache in decode node in PD disaggrecation?

xiezhq-hermann · 2025-08-20T08:06:58Z

@xiezhq-hermann Do you have any plan to support hicache in decode node in PD disaggrecation?

Yes, @ShangmingCai is pushing for a solution that fetching cache on P node and writing to cache pool on D node.

xiezhq-hermann added 14 commits June 17, 2025 01:31

fix layer done counter sync and hiradix pre-compute

257808d

Merge branch 'main' into xiezhq-hicache-upstream

9766794

Merge branch 'main' into xiezhq-hicache-upstream

c5a23b1

kv cache io kernels

897a66f

layout change on memory pool and interfaces for the new kernels

95bc313

Merge branch 'main' into xiezhq-hicache-upstream

f8ea5e9

Merge branch 'main' into xiezhq-hicache-upstream

28a2e8e

Merge branch 'main' into xiezhq-hicache-upstream

3a92509

Merge branch 'main' into xiezhq-hicache-upstream

ddddf42

bump to sgl-kernel 2.0

4146051

Merge branch 'main' into xiezhq-hicache-upstream

554a91b

Merge branch 'main' into xiezhq-hicache-upstream

3546e02

io backend parameter

4b816da

hicache storage prototype

5e730e5

xiezhq-hermann changed the base branch from main to xiezhq-hicache-upstream July 2, 2025 02:55

gemini-code-assist bot reviewed Jul 2, 2025

View reviewed changes

update the prefetching logics

f75ff68

xiezhq-hermann changed the title ~~Hicache Storage Layer Prototype~~ [WIP] Hicache Storage Layer Prototype Jul 3, 2025

xiezhq-hermann mentioned this pull request Jul 3, 2025

[RFC] Remote KV Connector for SGLang Global Cache Reuse and PD #7746

Open

10 tasks

xiezhq-hermann added 2 commits July 4, 2025 16:21

fall back to the non-continuous memory pool

516abc6

Merge branch 'main' into xiezhq-hicache-upstream

fd74ed0

xiezhq-hermann force-pushed the xiezhq-hicache-upstream branch from 96e733c to fd74ed0 Compare July 4, 2025 23:22

xiezhq-hermann added 4 commits July 4, 2025 17:02

refactoring

7cfc9ad

Merge branch 'main' into xiezhq-hicache-upstream

c05e19d

fix

3f38e28

Merge branch 'xiezhq-hicache-upstream' into xiezhq-hicache-storage

f82cc02

xiezhq-hermann force-pushed the xiezhq-hicache-storage branch from 842bab6 to f82cc02 Compare July 5, 2025 00:11

hnyls2002 self-assigned this Jul 5, 2025

xiezhq-hermann self-assigned this Jul 6, 2025

xiezhq-hermann added 2 commits July 17, 2025 12:21

remove the delete and clear interfaces

8c1c46c

Merge branch 'main' into xiezhq-hicache-storage

1c9a64e

ispobock merged commit 9d33fcf into main Jul 18, 2025
100 of 114 checks passed

ispobock deleted the xiezhq-hicache-storage branch July 18, 2025 07:20

merrymercy mentioned this pull request Jul 21, 2025

Development Roadmap (2025 H2) #7736

Open

1 task

faradawn mentioned this pull request Jul 22, 2025

[WIP] Add NIXL and Mooncake as another HiCache Storage Backend #8244

Open

6 tasks

ch-wan pushed a commit that referenced this pull request Jul 23, 2025

Hicache Storage Layer Prototype (#7704)

6a1b570

ShangmingCai mentioned this pull request Jul 23, 2025

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Open

21 tasks

xiezhq-hermann mentioned this pull request Jul 24, 2025

HiCache Storage TP Refinement #8307

Merged

6 tasks

hzh0425 mentioned this pull request Jul 29, 2025

feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. #8516

Merged

6 tasks

xiezhq-hermann added a commit that referenced this pull request Jul 31, 2025

Add hf3fs support for hicache storage (based on #7704) (#7280)

2998033

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Aug 1, 2025

Add hf3fs support for hicache storage (based on sgl-project#7704) (sg…

5900351

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025

Add hf3fs support for hicache storage (based on sgl-project#7704) (sg…

6f7503b

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

xiezhq-hermann mentioned this pull request Aug 3, 2025

HiCache storage, style change and bug fix #8719

Merged

6 tasks

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025

Add hf3fs support for hicache storage (based on #7704) (#7280)

1fd6056

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Add hf3fs support for hicache storage (based on #7704) (#7280)

979d844

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Add hf3fs support for hicache storage (based on #7704) (#7280)

7390394

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

hzh0425 mentioned this pull request Aug 15, 2025

feat(hicache-3fs): 3FS-SGLang Hierarchical Cache Deployment Guide #9213

Merged

4 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Add hf3fs support for hicache storage (based on sgl-project#7704) (sg…

d3b7c4a

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

Add hf3fs support for hicache storage (based on sgl-project#7704) (sg…

f4db210

…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

Hicache Storage Layer Prototype #7704

Hicache Storage Layer Prototype #7704

Conversation

xiezhq-hermann commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

didoteebin commented Jul 20, 2025

Uh oh!

xiezhq-hermann commented Jul 21, 2025

Uh oh!

soyail commented Jul 22, 2025

Uh oh!

xiezhq-hermann commented Jul 22, 2025

Uh oh!

Charles-L-Chen commented Jul 24, 2025

Uh oh!

Alisehen commented Aug 11, 2025

Uh oh!

xiezhq-hermann commented Aug 11, 2025

Uh oh!

xiezhq-hermann commented Aug 11, 2025

Uh oh!

wqlxx commented Aug 19, 2025

Uh oh!

xiezhq-hermann commented Aug 20, 2025

Uh oh!

Uh oh!

xiezhq-hermann commented Jul 2, 2025 •

edited

Loading