Skip to content

[Bug] No performance gain after using hierarchical cache #7059

@YJHMITWEB

Description

@YJHMITWEB

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi, @xiezhq-hermann , thanks for your great effort in hierarchical cache. However, in my experiements, I do not observe any performance gain when this feature is enabled.

I am using Llama 3.1 8B model, and various sequence lengths, on a 4xH100 96GB NVLINK node.

The results shown in this PR indicate almost 3x faster in TTFT and 30% faster in TPOT.

I thought the reason that hierarchical cache is faster is because it is moving the spilling KV cache to host memory, instead of just throwing it away. In this case, when the evicted requests are added later, they can just use the KV cache on host memory instead of recomputing it.

However, in all of my experiments, hierarchical cache either brings no performance gain or slows down both TTFT and TPOT.

Reproduction

For long context generation, e.g. input len=2k, output len=100k,

with Hi-cache:

python -m sglang.launch_server --model-path /hub/Llama-3.1-8B-Instruct/ --attention-backend flashinfer --port 30001 --trust-remote-code --enable-hierarchical-cache --mem-fraction-static 0.4 --tp 4

python bench_serving.py --port 30001 --dataset-name random --random-input-len 2048 --random-output-len 100000 --num-prompts 15 --max-concurrency 15 --disable-shuffle --random-range-ratio 1.0

[2025-06-09 22:16:48 TP0] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP3] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP1] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP2] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP0] Capture cuda graph end. Time elapsed: 7.01 s. mem usage=1.34 GB. avail mem=52.97 GB.
[2025-06-09 22:16:48 TP2] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP3] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP1] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP0] max_total_num_tokens=1005629, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=3929, context_len=131072
[2025-06-09 22:16:48 TP0] Allocating 65.90 GB host memory for hierarchical KV cache.

#Input tokens: 30720
#Output tokens: 1500000
Num of shared prefixes or conversations: 15
Num of total requests: 15
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|███████████████████████████████████████████████████████████████████████████| 15/15 [28:41<00:00, 114.77s/it]
Total outputs: 15

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 15        
Successful requests:                     15        
Benchmark duration (s):                  1721.58   
Total input tokens:                      31234     
Total generated tokens:                  1500000   
Total generated tokens (retokenized):    1456003   
Request throughput (req/s):              0.01      
Input token throughput (tok/s):          18.14     
Output token throughput (tok/s):         871.29    
Total token throughput (tok/s):          889.44    
Concurrency:                             13.03     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1495499.64
Median E2E Latency (ms):                 1416772.19
---------------Time to First Token----------------
Mean TTFT (ms):                          680.10    
Median TTFT (ms):                        272.39    
P90 TTFT (ms):                           2352.16   
P99 TTFT (ms):                           2352.30   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.95     
Median TPOT (ms):                        14.17     
P90 TPOT (ms):                           16.87     
P99 TPOT (ms):                           17.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.39     
Median ITL (ms):                         14.24     
P90 ITL (ms):                            18.78     
P99 ITL (ms):                            28.14     
==================================================

without Hi-cache:

python -m sglang.launch_server --model-path /hub/Llama-3.1-8B-Instruct/ --attention-backend flashinfer --port 30001 --trust-remote-code --mem-fraction-static 0.4 --tp 4

python bench_serving.py --port 30001 --dataset-name random --random-input-len 2048 --random-output-len 100000 --num-prompts 15 --max-concurrency 15 --disable-shuffle --random-range-ratio 1.0

#Input tokens: 30720
#Output tokens: 1500000
Num of shared prefixes or conversations: 15
Num of total requests: 15
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|███████████████████████████████████████████████████████████████████████████| 15/15 [28:39<00:00, 114.64s/it]
Total outputs: 15

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 15        
Successful requests:                     15        
Benchmark duration (s):                  1719.61   
Total input tokens:                      31234     
Total generated tokens:                  1500000   
Total generated tokens (retokenized):    1437831   
Request throughput (req/s):              0.01      
Input token throughput (tok/s):          18.16     
Output token throughput (tok/s):         872.29    
Total token throughput (tok/s):          890.45    
Concurrency:                             13.03     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1493536.43
Median E2E Latency (ms):                 1414419.30
---------------Time to First Token----------------
Mean TTFT (ms):                          688.62    
Median TTFT (ms):                        253.27    
P90 TTFT (ms):                           2469.10   
P99 TTFT (ms):                           2469.29   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.93     
Median TPOT (ms):                        14.14     
P90 TPOT (ms):                           16.85     
P99 TPOT (ms):                           17.15     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.50     
Median ITL (ms):                         14.27     
P90 ITL (ms):                            18.79     
P99 ITL (ms):                            33.73     
==================================================

I also ran another set of experiments where the input len = 32k, output len = 32K, and the results are:

num requests Hi-cache RPS E2E TTFT TPOT ITL
64 0.12 306654.91 10549.14 33.94 20.68
64 0.12 310333.71 12848.84 37.47 20.80

I added some profiling log to the code, and I do see the hi-cache caches nodes to host memory and later loads them back. But since the benchmarks show the almost same result, I am wondering if there cached nodes are somehow also recomputed again instead of just reused.

And also could you clarify if my understanding of how hi-cache could improve the performance is correct (i.e. avoid recomputation)?

Environment

cuda 12.6
sglang 0.4.6 post5
pytorch 2.6

4*H100 98GB NVLINK
PCIe Gen5
1T host memory

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions