Cuda graph supported bs in DP attention

I have been testing and recording the output throughput of SGLang on 2*8 H100 GPUs, and I've observed a significant regression in output throughput for long outputs in the `enable-dp-attention` scenarios following this [PR](https://github.com/sgl-project/sglang/pull/4390). Through debugging and profiling with Nsight Systems, I confirmed that the performance degradation is caused by the CUDA graph not being properly launched.

See the [code](https://github.com/sgl-project/sglang/blob/00391f58a3c2b13b00dc0a0a983a5c7fab399883/python/sglang/srt/model_executor/cuda_graph_runner.py#L280)
```
if self.enable_dp_attention:
    total_global_tokens = sum(forward_batch.global_num_tokens_cpu)

    is_bs_supported = forward_batch.can_run_dp_cuda_graph and (
        total_global_tokens in self.graphs
        if self.disable_padding
        else total_global_tokens <= self.max_bs
    )
```
After `enable-dp-attention`, `total_global_tokens` equals the sum of tokens across all DP ranks. For example, during the decode phase with DP = TP = 16 and a per-rank batch size of 32, the `total_global_tokens` would be 32 * 16 = 512. However, the maximum batch size allowed for CUDA graph capture defaults to 160. As a result, the `can_run` function during the decode phase returns `False`, and CUDA graph execution is consequently skipped.

In fact, according to the design logic of the [PR](https://github.com/sgl-project/sglang/pull/4390), I don't consider this a bug—it consistently uses the `total_global_tokens` across all ranks as the batch size to be captured by the CUDA graph. A straightforward solution would be to set a sufficiently large `cuda-graph-max-bs` when launching the server, though this might consume a significant amount of additional memory.

I believe using the `num_tokens` per DP rank as the CUDA graph batch size might be a more reasonable approach, similar to the code prior to this [PR](https://github.com/sgl-project/sglang/pull/4390). It would only require reserving adequate space for the `gathered_buffer`.

Below are my test output throughput before and after this [PR](https://github.com/sgl-project/sglang/pull/4390).
2*8 H100, input_len=output_len=1000, DP=TP=16
| Concurrency| before PR | after PR | after PR-fix |
|:---:|:---:|:---:|:---:|
| 1024  | 5115.13  | 2581.07  | 5469.53  |
| 512  | 3897.78  | 1527.95  | 4509.96  |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda graph supported bs in DP attention #5527

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cuda graph supported bs in DP attention #5527

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions