[Bug] hierarchical_cache oom

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

hi~ @xiezhq-hermann You are the main contributor to hierarchical cache, thank you for your great work！  I have a few questions about hierarchical cache, I'm very confused so I'm looking for your help.

Recently we want to try to use hierarchical cache, before that, for DeepSeek R1 , our online args `--mem-fraction-static` is 0.95.

- When I try to launch sglang server with args `--mem-fraction-static=0.95`, `--enable-hierarchical-cache`, `--hicache-ratio=2` , OOM occurs before the server is successfully started
-  with args `--mem-fraction-static=0.94`, `--enable-hierarchical-cache`, `--hicache-ratio=2` , sglang server is successfully started  but OOM occurs while running the benchmark.
- with args `--mem-fraction-static=0.93`, `--enable-hierarchical-cache`, `--hicache-ratio=10` , OOM occurs while running the benchmark

So I can only reduce `--mem-fraction-static` to 0.93.  Based on 0.93, I tried to set different `--hicache-ratio` values, and I found that the memory usage of GPU0 device was significantly higher than that of other devices. which is the main cause of OOM. The following figure was taken after sglang server was successfully started and before running the benchmark.

<img width="879" alt="Image" src="https://github.com/user-attachments/assets/13fdbd60-7b9d-4432-961c-16b0986ba56b" />

I debugged the hierarchical cache code and found that HiCacheController used multi stream to read and write CPU cache. When I disabled multi stream with [code](https://github.com/sgl-project/sglang/compare/main...AniZpZ:sglang:disable_hicache_multi_stream), The phenomenon that GPU0 device has higher memory usage than other devices has disappeared. At this point I can even launch sglang server with args `--mem-fraction-static=0.95`, `--enable-hierarchical-cache`, `--hicache-ratio 13`, the server can be started successfully, but it will oom when running the benchmark. I found that the oom probably occurs when the cache is loaded from the CPU to GPU for the first time. I will confirm this.

I have a few questions:

- Why does multi stream cause GPU0 to occupy more GPU memory?
- Why does gpu0 need more memory, but other gpus don't?
- Why does the additional memory usage of gpu0 vary with different `--hicache-ratio`？Hicache-ratio is just the ratio of CPU: GPU cache. As far as I understand, it has no direct connection with GPU memory.


For DeepSeek R1 , reducing `--mem-fraction-static` to 0.93  has a significant impact on throughput. Because the number of GPU tokens is reduced by 23%, if the hit rate of the CPU cache is not high, there will be no benefit at all.  `max_total_num_tokens` Data are as follows：

<img width="479" alt="Image" src="https://github.com/user-attachments/assets/6daaf010-6161-4b12-bdb8-ba7b878d1549" />

I wonder if it is possible to turn on hierarchical cache without affecting throughput too much.

Thank you very much！

### Reproduction


python -m sglang.launch_server --host 0.0.0.0 --dtype auto --mem-fraction-static 0.93 --tp-size 8 --chat-template /path/to/r1.jinja --max-running-requests 48 --trust-remote-code --enable-cache-report --log-level info --chunked-prefill-size 4096 --context-length 65536 --quantization fp8 --enable-torch-compile --cuda-graph-max-bs 64 --torch-compile-max-bs 36 --enable-flashinfer-mla --enable-mixed-chunk --model-path /path/to/DeepSeek-R1 --port 8188 --enable-hierarchical-cache --hicache-ratio 2


### Environment

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sglang: 0.4.5
sgl_kernel: 0.0.8
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.0
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.23.1
orjson: 3.10.12
outlines: 0.1.11
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.6
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.7.2
xgrammar: 0.1.17
openai: 1.65.2
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.62.1
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2   NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS   SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS   SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS   SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS   SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX   NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     NODE   NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE   PIX     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE   NODE    48-95,144-191   1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS   SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS   SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X    NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE    X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3


ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] hierarchical_cache oom #5372

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] hierarchical_cache oom #5372

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions