-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
Hi, @xiezhq-hermann , thanks for your great effort in hierarchical cache. However, in my experiements, I do not observe any performance gain when this feature is enabled.
I am using Llama 3.1 8B model, and various sequence lengths, on a 4xH100 96GB NVLINK node.
The results shown in this PR indicate almost 3x faster in TTFT and 30% faster in TPOT.
I thought the reason that hierarchical cache is faster is because it is moving the spilling KV cache to host memory, instead of just throwing it away. In this case, when the evicted requests are added later, they can just use the KV cache on host memory instead of recomputing it.
However, in all of my experiments, hierarchical cache either brings no performance gain or slows down both TTFT and TPOT.
Reproduction
For long context generation, e.g. input len=2k, output len=100k,
with Hi-cache:
python -m sglang.launch_server --model-path /hub/Llama-3.1-8B-Instruct/ --attention-backend flashinfer --port 30001 --trust-remote-code --enable-hierarchical-cache --mem-fraction-static 0.4 --tp 4
python bench_serving.py --port 30001 --dataset-name random --random-input-len 2048 --random-output-len 100000 --num-prompts 15 --max-concurrency 15 --disable-shuffle --random-range-ratio 1.0
[2025-06-09 22:16:48 TP0] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP3] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP1] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP2] Registering 1495 cuda graph addresses
[2025-06-09 22:16:48 TP0] Capture cuda graph end. Time elapsed: 7.01 s. mem usage=1.34 GB. avail mem=52.97 GB.
[2025-06-09 22:16:48 TP2] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP3] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP1] Allocating 65.90 GB host memory for hierarchical KV cache.
[2025-06-09 22:16:48 TP0] max_total_num_tokens=1005629, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=3929, context_len=131072
[2025-06-09 22:16:48 TP0] Allocating 65.90 GB host memory for hierarchical KV cache.
#Input tokens: 30720
#Output tokens: 1500000
Num of shared prefixes or conversations: 15
Num of total requests: 15
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|███████████████████████████████████████████████████████████████████████████| 15/15 [28:41<00:00, 114.77s/it]
Total outputs: 15
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 15
Successful requests: 15
Benchmark duration (s): 1721.58
Total input tokens: 31234
Total generated tokens: 1500000
Total generated tokens (retokenized): 1456003
Request throughput (req/s): 0.01
Input token throughput (tok/s): 18.14
Output token throughput (tok/s): 871.29
Total token throughput (tok/s): 889.44
Concurrency: 13.03
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1495499.64
Median E2E Latency (ms): 1416772.19
---------------Time to First Token----------------
Mean TTFT (ms): 680.10
Median TTFT (ms): 272.39
P90 TTFT (ms): 2352.16
P99 TTFT (ms): 2352.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.95
Median TPOT (ms): 14.17
P90 TPOT (ms): 16.87
P99 TPOT (ms): 17.17
---------------Inter-token Latency----------------
Mean ITL (ms): 15.39
Median ITL (ms): 14.24
P90 ITL (ms): 18.78
P99 ITL (ms): 28.14
==================================================
without Hi-cache:
python -m sglang.launch_server --model-path /hub/Llama-3.1-8B-Instruct/ --attention-backend flashinfer --port 30001 --trust-remote-code --mem-fraction-static 0.4 --tp 4
python bench_serving.py --port 30001 --dataset-name random --random-input-len 2048 --random-output-len 100000 --num-prompts 15 --max-concurrency 15 --disable-shuffle --random-range-ratio 1.0
#Input tokens: 30720
#Output tokens: 1500000
Num of shared prefixes or conversations: 15
Num of total requests: 15
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|███████████████████████████████████████████████████████████████████████████| 15/15 [28:39<00:00, 114.64s/it]
Total outputs: 15
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 15
Successful requests: 15
Benchmark duration (s): 1719.61
Total input tokens: 31234
Total generated tokens: 1500000
Total generated tokens (retokenized): 1437831
Request throughput (req/s): 0.01
Input token throughput (tok/s): 18.16
Output token throughput (tok/s): 872.29
Total token throughput (tok/s): 890.45
Concurrency: 13.03
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1493536.43
Median E2E Latency (ms): 1414419.30
---------------Time to First Token----------------
Mean TTFT (ms): 688.62
Median TTFT (ms): 253.27
P90 TTFT (ms): 2469.10
P99 TTFT (ms): 2469.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.93
Median TPOT (ms): 14.14
P90 TPOT (ms): 16.85
P99 TPOT (ms): 17.15
---------------Inter-token Latency----------------
Mean ITL (ms): 15.50
Median ITL (ms): 14.27
P90 ITL (ms): 18.79
P99 ITL (ms): 33.73
==================================================
I also ran another set of experiments where the input len = 32k, output len = 32K, and the results are:
num requests | Hi-cache | RPS | E2E | TTFT | TPOT | ITL |
---|---|---|---|---|---|---|
64 | 0.12 | 306654.91 | 10549.14 | 33.94 | 20.68 | |
64 | ✔ | 0.12 | 310333.71 | 12848.84 | 37.47 | 20.80 |
I added some profiling log to the code, and I do see the hi-cache caches nodes to host memory and later loads them back. But since the benchmarks show the almost same result, I am wondering if there cached nodes are somehow also recomputed again instead of just reused.
And also could you clarify if my understanding of how hi-cache could improve the performance is correct (i.e. avoid recomputation)?
Environment
cuda 12.6
sglang 0.4.6 post5
pytorch 2.6
4*H100 98GB NVLINK
PCIe Gen5
1T host memory