Add gen-shared-prefix dataset in bench_serving #1990
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Add a new dataset
gen-shared-prefix
inbench_serving
to evaluate the performance of the engine on the shared prefix generation task. The format of the requests is<system prompt> <question>
. This benchmark is useful for evaluating cache-aware optimization like cache-aware DP.Modifications
Tests
Python
--dp 4
(round robin)python DP = 4 ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 1024 Benchmark duration (s): 54.54 Total input tokens: 2326152 Total generated tokens: 262144 Total generated tokens (retokenized): 253772 Request throughput (req/s): 18.78 Input token throughput (tok/s): 42651.75 Output token throughput (tok/s): 4806.61 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 53560.65 Median E2E Latency (ms): 53362.96 ---------------Time to First Token---------------- Mean TTFT (ms): 33504.17 Median TTFT (ms): 35524.99 P99 TTFT (ms): 36763.44 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 78.65 Median TPOT (ms): 70.23 P99 TPOT (ms): 180.28 ---------------Inter-token Latency---------------- Mean ITL (ms): 78.92 Median ITL (ms): 68.45 P99 ITL (ms): 527.79 ================================================== [2024-11-11 00:06:15 DP3 TP0] Prefill batch. #new-seq: 2, #new-token: 699, #cached-token: 3827, cache hit rate: 21.75%, token usage: 0.40, #running-req: 254, #queue-req: 1
Rust Approx Tree
--dp 4
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 1024 Benchmark duration (s): 28.27 Total input tokens: 2327209 Total generated tokens: 262144 Total generated tokens (retokenized): 256144 Request throughput (req/s): 36.23 Input token throughput (tok/s): 82328.91 Output token throughput (tok/s): 9273.78 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 26073.70 Median E2E Latency (ms): 27724.49 ---------------Time to First Token---------------- Mean TTFT (ms): 7005.06 Median TTFT (ms): 7061.64 P99 TTFT (ms): 7995.75 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 74.78 Median TPOT (ms): 79.34 P99 TPOT (ms): 99.97 ---------------Inter-token Latency---------------- Mean ITL (ms): 75.60 Median ITL (ms): 74.43 P99 ITL (ms): 94.90 ================================================== [2024-11-11 00:12:58 TP0] Prefill batch. #new-seq: 27, #new-token: 3517, #cached-token: 57525, cache hit rate: 84.86%, token usage: 0.16, #running-req: 245, #queue-req: 1
Checklist