Skip to content

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented Nov 11, 2024

Motivation

Add a new dataset gen-shared-prefix in bench_serving to evaluate the performance of the engine on the shared prefix generation task. The format of the requests is <system prompt> <question>. This benchmark is useful for evaluating cache-aware optimization like cache-aware DP.

Modifications

  1. Add function to generate shared prefix requests
  2. Add related arguments

Tests

Python --dp 4 (round robin)

python DP = 4 
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  54.54     
Total input tokens:                      2326152   
Total generated tokens:                  262144    
Total generated tokens (retokenized):    253772    
Request throughput (req/s):              18.78     
Input token throughput (tok/s):          42651.75  
Output token throughput (tok/s):         4806.61   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   53560.65  
Median E2E Latency (ms):                 53362.96  
---------------Time to First Token----------------
Mean TTFT (ms):                          33504.17  
Median TTFT (ms):                        35524.99  
P99 TTFT (ms):                           36763.44  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.65     
Median TPOT (ms):                        70.23     
P99 TPOT (ms):                           180.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.92     
Median ITL (ms):                         68.45     
P99 ITL (ms):                            527.79    
==================================================

[2024-11-11 00:06:15 DP3 TP0] Prefill batch. #new-seq: 2, #new-token: 699, #cached-token: 3827, cache hit rate: 21.75%, token usage: 0.40, #running-req: 254, #queue-req: 1

Rust Approx Tree --dp 4

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     1024      
Benchmark duration (s):                  28.27     
Total input tokens:                      2327209   
Total generated tokens:                  262144    
Total generated tokens (retokenized):    256144    
Request throughput (req/s):              36.23     
Input token throughput (tok/s):          82328.91  
Output token throughput (tok/s):         9273.78   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26073.70  
Median E2E Latency (ms):                 27724.49  
---------------Time to First Token----------------
Mean TTFT (ms):                          7005.06   
Median TTFT (ms):                        7061.64   
P99 TTFT (ms):                           7995.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          74.78     
Median TPOT (ms):                        79.34     
P99 TPOT (ms):                           99.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.60     
Median ITL (ms):                         74.43     
P99 ITL (ms):                            94.90     
==================================================

[2024-11-11 00:12:58 TP0] Prefill batch. #new-seq: 27, #new-token: 3517, #cached-token: 57525, cache hit rate: 84.86%, token usage: 0.16, #running-req: 245, #queue-req: 1

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

Copy link
Member

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -627,6 +627,61 @@ def sample_random_requests(
return input_requests


def gen_prompt(tokenizer, token_num):
"""Generate a random prompt of specified token length using tokenizer vocabulary."""
all_available_tokens = list(tokenizer.get_vocab().values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Using tokenizer.get_vocab().values() may include special tokens, which could cause issues with the generated text.

@zhyncs
Copy link
Member

zhyncs commented Nov 11, 2024

The new benchmark dataset is very interesting. I think it can be merged soon, and perhaps Yichuan @yichuan520030910320 will also be interested.

@zhyncs
Copy link
Member

zhyncs commented Nov 11, 2024

This change does not affect other modules, so there is no need to run perf/accu/ut completely.

@zhyncs zhyncs merged commit 8169c6f into sgl-project:main Nov 11, 2024
1 of 12 checks passed
@ByronHsu
Copy link
Collaborator Author

Thanks @zhyncs! Note that Total input tokens may fluctuate a bit because len(tokenize(prompt + question)) != len(tokenize(prompt) + tokenize(question))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants