Skip to content

Cuda graph supported bs in DP attention #5527

@Cydia2018

Description

@Cydia2018

I have been testing and recording the output throughput of SGLang on 2*8 H100 GPUs, and I've observed a significant regression in output throughput for long outputs in the enable-dp-attention scenarios following this PR. Through debugging and profiling with Nsight Systems, I confirmed that the performance degradation is caused by the CUDA graph not being properly launched.

See the code

if self.enable_dp_attention:
    total_global_tokens = sum(forward_batch.global_num_tokens_cpu)

    is_bs_supported = forward_batch.can_run_dp_cuda_graph and (
        total_global_tokens in self.graphs
        if self.disable_padding
        else total_global_tokens <= self.max_bs
    )

After enable-dp-attention, total_global_tokens equals the sum of tokens across all DP ranks. For example, during the decode phase with DP = TP = 16 and a per-rank batch size of 32, the total_global_tokens would be 32 * 16 = 512. However, the maximum batch size allowed for CUDA graph capture defaults to 160. As a result, the can_run function during the decode phase returns False, and CUDA graph execution is consequently skipped.

In fact, according to the design logic of the PR, I don't consider this a bug—it consistently uses the total_global_tokens across all ranks as the batch size to be captured by the CUDA graph. A straightforward solution would be to set a sufficiently large cuda-graph-max-bs when launching the server, though this might consume a significant amount of additional memory.

I believe using the num_tokens per DP rank as the CUDA graph batch size might be a more reasonable approach, similar to the code prior to this PR. It would only require reserving adequate space for the gathered_buffer.

Below are my test output throughput before and after this PR.
2*8 H100, input_len=output_len=1000, DP=TP=16

Concurrency before PR after PR after PR-fix
1024 5115.13 2581.07 5469.53
512 3897.78 1527.95 4509.96

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions