-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
I have been testing and recording the output throughput of SGLang on 2*8 H100 GPUs, and I've observed a significant regression in output throughput for long outputs in the enable-dp-attention
scenarios following this PR. Through debugging and profiling with Nsight Systems, I confirmed that the performance degradation is caused by the CUDA graph not being properly launched.
See the code
if self.enable_dp_attention:
total_global_tokens = sum(forward_batch.global_num_tokens_cpu)
is_bs_supported = forward_batch.can_run_dp_cuda_graph and (
total_global_tokens in self.graphs
if self.disable_padding
else total_global_tokens <= self.max_bs
)
After enable-dp-attention
, total_global_tokens
equals the sum of tokens across all DP ranks. For example, during the decode phase with DP = TP = 16 and a per-rank batch size of 32, the total_global_tokens
would be 32 * 16 = 512. However, the maximum batch size allowed for CUDA graph capture defaults to 160. As a result, the can_run
function during the decode phase returns False
, and CUDA graph execution is consequently skipped.
In fact, according to the design logic of the PR, I don't consider this a bug—it consistently uses the total_global_tokens
across all ranks as the batch size to be captured by the CUDA graph. A straightforward solution would be to set a sufficiently large cuda-graph-max-bs
when launching the server, though this might consume a significant amount of additional memory.
I believe using the num_tokens
per DP rank as the CUDA graph batch size might be a more reasonable approach, similar to the code prior to this PR. It would only require reserving adequate space for the gathered_buffer
.
Below are my test output throughput before and after this PR.
2*8 H100, input_len=output_len=1000, DP=TP=16
Concurrency | before PR | after PR | after PR-fix |
---|---|---|---|
1024 | 5115.13 | 2581.07 | 5469.53 |
512 | 3897.78 | 1527.95 | 4509.96 |