Skip to content

Conversation

xiezhq-hermann
Copy link
Collaborator

@xiezhq-hermann xiezhq-hermann commented Jul 2, 2025

Motivation

This PR aims to introduce a standard storage interface for hierarchical KV caching (first introduced in #2693), so that the community can plug in different storage backend like Mooncake.

Recently, there are PR and proposals to integrate storage layer into the hierarchical KV caching eco-system:
#7211
#7280
#7896
#7920
#7576
#7746 (comment)
#7761 (comment)
and some more non-public inquery.

After extensive discussion, we decided to move forward with following plan:

  1. Keep iterating and mataining a high performance Radix and HiRadix memory management backbone to prevent performance regression for the best performance.
  2. Set up a standard storage interface so contributors from the community can easily develop and integrate their performant storage backend. For now, a minimal HiCacheFile backend is tested for demonstration purpose, and Mooncake integration is under active development. Please let us know if you plan to integrate another backend and any enhancement you might need for the interfaces.
  3. Set up a few necessary hooks for standard scheduling policy development. For now, only a best-effort prefetching and hot spot write-through backup are supported. Please do let us know your need so we can clean up the interfaces for your need.

Some of other follow-up work:

  • Integrating full hicache functionality with PD disaggregation and router level scheduling.

Modifications

Checklist

@xiezhq-hermann xiezhq-hermann changed the base branch from main to xiezhq-hicache-upstream July 2, 2025 02:55
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xiezhq-hermann, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a prototype for a hierarchical cache (Hicache) storage layer, enabling the persistence and prefetching of KV cache data. It defines a generic storage interface and provides a file-based implementation, integrating these capabilities into the existing cache management system. The changes also involve a significant refactoring of KV cache input/output operations between GPU and CPU, utilizing optimized kernel functions for improved efficiency.

Highlights

  • New Hicache Storage Layer: Introduced a persistent storage layer for KV cache, starting with a file-based implementation (HiCacheFile). This allows KV cache data to be stored on disk, enabling more efficient memory management and potentially larger context windows.
  • KV Cache Prefetching: Implemented a prefetching mechanism that can load KV cache data from the new storage layer into host memory asynchronously, anticipating future needs and reducing latency during model inference.
  • Refactored KV Cache I/O: Significantly refactored how KV cache data is transferred between device (GPU) and host (CPU) memory. This leverages new, optimized sgl-kernel functions for more efficient and unified load_from_host_per_layer and backup_to_host_all_layer operations, and standardizes buffer structures.
  • Hierarchical Cache Integration: The new storage and prefetching capabilities are deeply integrated into the existing HiCacheController and HiRadixCache components, enhancing the overall hierarchical caching system with disk-backed storage.
  • Dependency and Configuration Updates: Updated the sgl-kernel dependency to version 0.2.0 to support the new KV cache I/O functionalities and added new hicache_io_backend and hicache_storage_backend configuration options.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a prototype for a hierarchical cache (HiCache) storage layer, which is a significant feature. The changes add the capability to prefetch KV cache data from a persistent storage backend, which can improve performance for requests with long prefixes.

The overall structure is well-thought-out, with a clear separation of concerns between the storage backend, the cache controller, and the radix tree cache implementation. However, there are several critical issues related to correctness and thread safety that must be addressed before this can be merged. These include incorrect API usage, race conditions, and bugs in the prefetching logic.

I've left detailed comments on specific lines of code to address these issues. Once these are resolved, this will be a great addition to the project.

@xiezhq-hermann xiezhq-hermann changed the title Hicache Storage Layer Prototype [WIP] Hicache Storage Layer Prototype Jul 3, 2025
@xiezhq-hermann xiezhq-hermann force-pushed the xiezhq-hicache-upstream branch from 96e733c to fd74ed0 Compare July 4, 2025 23:22
@xiezhq-hermann xiezhq-hermann force-pushed the xiezhq-hicache-storage branch from 842bab6 to f82cc02 Compare July 5, 2025 00:11
@hnyls2002 hnyls2002 self-assigned this Jul 5, 2025
@xiezhq-hermann xiezhq-hermann self-assigned this Jul 6, 2025
@ispobock ispobock merged commit 9d33fcf into main Jul 18, 2025
100 of 114 checks passed
@ispobock ispobock deleted the xiezhq-hicache-storage branch July 18, 2025 07:20
@didoteebin
Copy link

@xiezhq-hermann Hi xiezhq , why do you think a three layer GPU / CPU / File based kv cache storage mechanism is better ? To my best understanding , remove cpu layer and direct GDS transfer between GPU and File would be faster

@xiezhq-hermann
Copy link
Collaborator Author

@xiezhq-hermann Hi xiezhq , why do you think a three layer GPU / CPU / File based kv cache storage mechanism is better ? To my best understanding , remove cpu layer and direct GDS transfer between GPU and File would be faster

Good point @didoteebin the rationale lies in different hardware characteristics, for most systems, the PCIe bandwidth could enable us to do layer-wise overlapping of KV cache transfer and forward computation, (i.e., concurrently execute forward of layer 1 and loading KV caches for layer 2) as PCIe-5 could achieve about 50gb/s and about 20gb/s for PCIe-4 and the latency is more predictable. However, for most local disks and remote storage, the latency would be much higher and throughput would be lower. As a result we implemented different policies across different layers. Specifically, layer-wise zero overhead overlapping between GPU and CPU memory, and best-effort prefetching for storage devices. But I sure can see different hardware platform impact the design choice, e.g., a fast RDMA remote memory pool could make direct access more benefitial and we will keep evolving our systems to fit more users' need.
Thanks again for your PR as well, would love to integrate direct access as an alternative in near future as we complete the integration of some popular backend like Mooncake and 3FS.

@soyail
Copy link

soyail commented Jul 22, 2025

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@xiezhq-hermann
Copy link
Collaborator Author

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

@Charles-L-Chen
Copy link
Contributor

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

@xiezhq-hermann Is there any update on this issue?

xiezhq-hermann added a commit that referenced this pull request Jul 31, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Aug 1, 2025
TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025
lifuhuang pushed a commit that referenced this pull request Aug 3, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
@Alisehen
Copy link

Some questions need assistance in answering, thanks a lot
I've conducted benchmarks on this feature using the Deepseek R1 model with 16 H20 GPUs, where HiCacheFile is configured to use tmpfs. During the tests, I noticed that eviction performance might be causing watchdog timeouts. Are there any specific metrics or logging mechanisms that could help me analyze this issue in more detail? This set of test loads performs well in the version with only enable-hierarchical-cache enabled.
SGLang errorlog as below

params.SamplingParams object at 0x7f33ca0fe6e0>)], available_size=1160, evictable_size=361907,
2025-07-15 17:47:33 - ERROR - Pyspy failed to dump PID 566194. Error: /bin/dash: 1: py-spy: not found

2025-07-15 17:47:33 - ERROR - Watchdog timeout (self.watchdog_timeout=300)

FileSystem info

tmpfs           512G   36G  477G   7% /tmp/hicache

The startup script for one of the sglang nodes is as follows

GLOO_SOCKET_IFNAME=eth0 \
NCCL_SOCKET_IFNAME=eth0 \
NCCL_IB_GID_INDEX=3 \
NCCL_IB_HCA=mlx5_ \
NCCL_IB_DISABLE=0 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
MODEL_LENGTH=131072  \
TORCHINDUCTOR_FX_GRAPH_CACHE=1 \
TORCHINDUCTOR_AUTOGRAD_CACHE=1 \
TORCHINDUCTOR_CACHE_DIR="/data00/torch_compile/" \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
  --cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 48 56 64 72 76 78 80 82 \
  --cuda-graph-max-bs 82 \
  --attention-backend fa3 \
  --speculative-algo NEXTN \
  --speculative-draft /data00/DeepSeek-R1-NextN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --model-path /data00/DeepSeek-R1 \
  --tp 16 \
  --dist-init-addr 192.168.0.29:20000 \
  --nnodes 2 \
  --node-rank 0 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --enable-ep-moe \
  --max-running-requests 82  \
  --disable-chunked-prefix-cache \
  --enable-hierarchical-cache \
  --host 0.0.0.0 \
  --port 8080 \
  --hicache-storage-backend file

Hi @yapple, TP is not fully supported yet as there are still some designs about config details to be figured out. Will fix it ASAP : )

hello, has this problem been fixed?

@xiezhq-hermann
Copy link
Collaborator Author

I tested the performance of the hierarchical cache with benchmark_multiturn.py, and found that its performance actually degraded somewhat compared to the original implementation. As I understand it, at the beginning of the program, using the hierarchical cache requires storing the KV Cache, which introduces additional overhead. Then, I continued profiling with nsys and noticed that the D2H bandwidth was too low in the logs. After checking the source code, I found that when attention_backend="fa3", it performs a direct copy rather than zero-copy. What is the rationale behind this design choice?

@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache.

@xiezhq-hermann Is there any update on this issue?

@Charles-L-Chen the latest main has adopted a new backend selection mechanism, by default it will use fa3 for prefill and flashinfer for decoding to avoid this problem.

@xiezhq-hermann
Copy link
Collaborator Author

Some questions need assistance in answering, thanks a lot
I've conducted benchmarks on this feature using the Deepseek R1 model with 16 H20 GPUs, where HiCacheFile is configured to use tmpfs. During the tests, I noticed that eviction performance might be causing watchdog timeouts. Are there any specific metrics or logging mechanisms that could help me analyze this issue in more detail? This set of test loads performs well in the version with only enable-hierarchical-cache enabled.
SGLang errorlog as below

params.SamplingParams object at 0x7f33ca0fe6e0>)], available_size=1160, evictable_size=361907,
2025-07-15 17:47:33 - ERROR - Pyspy failed to dump PID 566194. Error: /bin/dash: 1: py-spy: not found

2025-07-15 17:47:33 - ERROR - Watchdog timeout (self.watchdog_timeout=300)

FileSystem info

tmpfs           512G   36G  477G   7% /tmp/hicache

The startup script for one of the sglang nodes is as follows

GLOO_SOCKET_IFNAME=eth0 \
NCCL_SOCKET_IFNAME=eth0 \
NCCL_IB_GID_INDEX=3 \
NCCL_IB_HCA=mlx5_ \
NCCL_IB_DISABLE=0 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
MODEL_LENGTH=131072  \
TORCHINDUCTOR_FX_GRAPH_CACHE=1 \
TORCHINDUCTOR_AUTOGRAD_CACHE=1 \
TORCHINDUCTOR_CACHE_DIR="/data00/torch_compile/" \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
EPMOE_USE_DEEPGEMM=1 \
python3 -m sglang.launch_server \
  --cuda-graph-bs 1 2 4 8 10 16 20 24 28 32 40 48 56 64 72 76 78 80 82 \
  --cuda-graph-max-bs 82 \
  --attention-backend fa3 \
  --speculative-algo NEXTN \
  --speculative-draft /data00/DeepSeek-R1-NextN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --model-path /data00/DeepSeek-R1 \
  --tp 16 \
  --dist-init-addr 192.168.0.29:20000 \
  --nnodes 2 \
  --node-rank 0 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --enable-ep-moe \
  --max-running-requests 82  \
  --disable-chunked-prefix-cache \
  --enable-hierarchical-cache \
  --host 0.0.0.0 \
  --port 8080 \
  --hicache-storage-backend file

Hi @yapple, TP is not fully supported yet as there are still some designs about config details to be figured out. Will fix it ASAP : )

hello, has this problem been fixed?

@Alisehen the TP issues should have been resolved already in latest main for the file backend, would you mind take another try?

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025
@wqlxx
Copy link

wqlxx commented Aug 19, 2025

@xiezhq-hermann Do you have any plan to support hicache in decode node in PD disaggrecation?

@xiezhq-hermann
Copy link
Collaborator Author

@xiezhq-hermann Do you have any plan to support hicache in decode node in PD disaggrecation?

Yes, @ShangmingCai is pushing for a solution that fetching cache on P node and writing to cache pool on D node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.