-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Hicache Storage Layer Prototype #7704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @xiezhq-hermann, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a prototype for a hierarchical cache (Hicache) storage layer, enabling the persistence and prefetching of KV cache data. It defines a generic storage interface and provides a file-based implementation, integrating these capabilities into the existing cache management system. The changes also involve a significant refactoring of KV cache input/output operations between GPU and CPU, utilizing optimized kernel functions for improved efficiency.
Highlights
- New Hicache Storage Layer: Introduced a persistent storage layer for KV cache, starting with a file-based implementation (
HiCacheFile
). This allows KV cache data to be stored on disk, enabling more efficient memory management and potentially larger context windows. - KV Cache Prefetching: Implemented a prefetching mechanism that can load KV cache data from the new storage layer into host memory asynchronously, anticipating future needs and reducing latency during model inference.
- Refactored KV Cache I/O: Significantly refactored how KV cache data is transferred between device (GPU) and host (CPU) memory. This leverages new, optimized
sgl-kernel
functions for more efficient and unifiedload_from_host_per_layer
andbackup_to_host_all_layer
operations, and standardizes buffer structures. - Hierarchical Cache Integration: The new storage and prefetching capabilities are deeply integrated into the existing
HiCacheController
andHiRadixCache
components, enhancing the overall hierarchical caching system with disk-backed storage. - Dependency and Configuration Updates: Updated the
sgl-kernel
dependency to version0.2.0
to support the new KV cache I/O functionalities and added newhicache_io_backend
andhicache_storage_backend
configuration options.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a prototype for a hierarchical cache (HiCache) storage layer, which is a significant feature. The changes add the capability to prefetch KV cache data from a persistent storage backend, which can improve performance for requests with long prefixes.
The overall structure is well-thought-out, with a clear separation of concerns between the storage backend, the cache controller, and the radix tree cache implementation. However, there are several critical issues related to correctness and thread safety that must be addressed before this can be merged. These include incorrect API usage, race conditions, and bugs in the prefetching logic.
I've left detailed comments on specific lines of code to address these issues. Once these are resolved, this will be a great addition to the project.
96e733c
to
fd74ed0
Compare
842bab6
to
f82cc02
Compare
@xiezhq-hermann Hi xiezhq , why do you think a three layer GPU / CPU / File based kv cache storage mechanism is better ? To my best understanding , remove cpu layer and direct GDS transfer between GPU and File would be faster |
Good point @didoteebin the rationale lies in different hardware characteristics, for most systems, the PCIe bandwidth could enable us to do layer-wise overlapping of KV cache transfer and forward computation, (i.e., concurrently execute forward of layer 1 and loading KV caches for layer 2) as PCIe-5 could achieve about 50gb/s and about 20gb/s for PCIe-4 and the latency is more predictable. However, for most local disks and remote storage, the latency would be much higher and throughput would be lower. As a result we implemented different policies across different layers. Specifically, layer-wise zero overhead overlapping between GPU and CPU memory, and best-effort prefetching for storage devices. But I sure can see different hardware platform impact the design choice, e.g., a fast RDMA remote memory pool could make direct access more benefitial and we will keep evolving our systems to fit more users' need. |
I tested the performance of the hierarchical cache with |
@soyail there is a bug associated with co-running fa3 backend and kv cache io loading kernel, which we are still investigating. For now we would recommend using flashinfer as the attention backend for hicache. |
@xiezhq-hermann Is there any update on this issue? |
…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
hello, has this problem been fixed? |
@Charles-L-Chen the latest main has adopted a new backend selection mechanism, by default it will use fa3 for prefill and flashinfer for decoding to avoid this problem. |
@Alisehen the TP issues should have been resolved already in latest main for the file backend, would you mind take another try? |
…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
…l-project#7280) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
@xiezhq-hermann Do you have any plan to support hicache in decode node in PD disaggrecation? |
Yes, @ShangmingCai is pushing for a solution that fetching cache on P node and writing to cache pool on D node. |
Motivation
This PR aims to introduce a standard storage interface for hierarchical KV caching (first introduced in #2693), so that the community can plug in different storage backend like Mooncake.
Recently, there are PR and proposals to integrate storage layer into the hierarchical KV caching eco-system:
#7211
#7280
#7896
#7920
#7576
#7746 (comment)
#7761 (comment)
and some more non-public inquery.
After extensive discussion, we decided to move forward with following plan:
HiCacheFile
backend is tested for demonstration purpose, and Mooncake integration is under active development. Please let us know if you plan to integrate another backend and any enhancement you might need for the interfaces.Some of other follow-up work:
Modifications
Checklist