Skip to content

Conversation

strgrb
Copy link
Collaborator

@strgrb strgrb commented Mar 19, 2025

Motivation

I profiled deepseek and observed some bubbles in timeline, finally found the cause:
image
This is caused by D2H copy in DeepSeekV2AttentionMLA.

Modifications

Using forward_batch.extend_prefix_lens_cpu directy instead of forward_batch.extend_prefix_lens, this can decrease TTFT

Checklist

@strgrb
Copy link
Collaborator Author

strgrb commented Mar 19, 2025

I benchmark it with cuda12.8 and DeepGEMM

  • before optimize
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.8952702959068, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.034614619996226656, "input_throughput": 138.4584799849066, "output_throughput": 34.61461999622665, "mean_e2e_latency_ms": 28885.89523830451, "median_e2e_latency_ms": 28867.54753405694, "std_e2e_latency_ms": 43.76537623950034, "p99_e2e_latency_ms": 28971.827788078226,"mean_ttft_ms": 914.2480821348727, "median_ttft_ms": 894.8089674813673, "std_ttft_ms": 42.84710320299341, "p99_ttft_ms": 999.2891409643926, "mean_tpot_ms": 27.999646802972617, "median_tpot_ms": 28.00029096845258, "std_tpot_ms": 0.004316766073899561, "p99_tpot_ms": 28.005626542935126, "mean_itl_ms": 27.999637626112442, "median_itl_ms": 27.986736968159676, "std_itl_ms": 0.5853971754709604, "p95_itl_ms": 28.546237177215517, "p99_itl_ms": 29.95921622263268, "concurrency": 0.9998742869247236, "accept_length": null}
  • after optimize
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.23054097685963, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.03469444967944199, "input_throughput": 138.77779871776798, "output_throughput": 34.694449679441995, "mean_e2e_latency_ms": 28819.560090778396, "median_e2e_latency_ms": 28802.58282844443, "std_e2e_latency_ms": 37.77366937690407, "p99_e2e_latency_ms": 28897.36494206125, "mean_ttft_ms": 827.0063975825906, "median_ttft_ms": 808.830501860939, "std_ttft_ms": 36.31374057753745, "p99_ttft_ms": 901.3628057297319, "mean_tpot_ms": 28.020574267463275, "median_tpot_ms": 28.022242382423276,"std_tpot_ms": 0.00486772951692425, "p99_tpot_ms": 28.02489307463156, "mean_itl_ms": 28.020567437950838, "median_itl_ms": 28.001354075968266, "std_itl_ms": 0.9080998589299655, "p95_itl_ms": 28.64757542265579, "p99_itl_ms": 30.345987735781822, "concurrency": 0.9998787773531658, "accept_length": null}

with TTFT from 914ms to 827ms

@zhyncs zhyncs merged commit df7014a into sgl-project:main Mar 19, 2025
17 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants