Skip to content

[Bug] hierarchical_cache oom #5372

@huangtingwei9988

Description

@huangtingwei9988

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

hi~ @xiezhq-hermann You are the main contributor to hierarchical cache, thank you for your great work! I have a few questions about hierarchical cache, I'm very confused so I'm looking for your help.

Recently we want to try to use hierarchical cache, before that, for DeepSeek R1 , our online args --mem-fraction-static is 0.95.

  • When I try to launch sglang server with args --mem-fraction-static=0.95, --enable-hierarchical-cache, --hicache-ratio=2 , OOM occurs before the server is successfully started
  • with args --mem-fraction-static=0.94, --enable-hierarchical-cache, --hicache-ratio=2 , sglang server is successfully started but OOM occurs while running the benchmark.
  • with args --mem-fraction-static=0.93, --enable-hierarchical-cache, --hicache-ratio=10 , OOM occurs while running the benchmark

So I can only reduce --mem-fraction-static to 0.93. Based on 0.93, I tried to set different --hicache-ratio values, and I found that the memory usage of GPU0 device was significantly higher than that of other devices. which is the main cause of OOM. The following figure was taken after sglang server was successfully started and before running the benchmark.

Image

I debugged the hierarchical cache code and found that HiCacheController used multi stream to read and write CPU cache. When I disabled multi stream with code, The phenomenon that GPU0 device has higher memory usage than other devices has disappeared. At this point I can even launch sglang server with args --mem-fraction-static=0.95, --enable-hierarchical-cache, --hicache-ratio 13, the server can be started successfully, but it will oom when running the benchmark. I found that the oom probably occurs when the cache is loaded from the CPU to GPU for the first time. I will confirm this.

I have a few questions:

  • Why does multi stream cause GPU0 to occupy more GPU memory?
  • Why does gpu0 need more memory, but other gpus don't?
  • Why does the additional memory usage of gpu0 vary with different --hicache-ratio?Hicache-ratio is just the ratio of CPU: GPU cache. As far as I understand, it has no direct connection with GPU memory.

For DeepSeek R1 , reducing --mem-fraction-static to 0.93 has a significant impact on throughput. Because the number of GPU tokens is reduced by 23%, if the hit rate of the CPU cache is not high, there will be no benefit at all. max_total_num_tokens Data are as follows:

Image

I wonder if it is possible to turn on hierarchical cache without affecting throughput too much.

Thank you very much!

Reproduction

python -m sglang.launch_server --host 0.0.0.0 --dtype auto --mem-fraction-static 0.93 --tp-size 8 --chat-template /path/to/r1.jinja --max-running-requests 48 --trust-remote-code --enable-cache-report --log-level info --chunked-prefill-size 4096 --context-length 65536 --quantization fp8 --enable-torch-compile --cuda-graph-max-bs 64 --torch-compile-max-bs 36 --enable-flashinfer-mla --enable-mixed-chunk --model-path /path/to/DeepSeek-R1 --port 8188 --enable-hierarchical-cache --hicache-ratio 2

Environment

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sglang: 0.4.5
sgl_kernel: 0.0.8
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.0
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.23.1
orjson: 3.10.12
outlines: 0.1.11
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.6
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.7.2
xgrammar: 0.1.17
openai: 1.65.2
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.62.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    hicacheHierarchical Caching for SGLang

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions