[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

hebiao064 · 2025-03-25T04:39:19Z

Motivation

Remove Unnecessary Device Sync from this comment
We found that when extend_no_prefix is false, we used the seqlens_in_batch.max.item as max_seq_len_q, which is not quite right, we should use extend_no_prefix

Thanks @Fridge003 on helping out by providing profiling tips.

Before: 39us

After: 18us

Benchmark

(venv) jobuser [ ~/sglang ]$ python benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████████████████| 1319/1319 [00:19<00:00, 68.15it/s]
Accuracy: 0.793
Invalid: 0.002
Latency: 17.409 s
Output throughput: 6893.539 token/s

python -m sglang.bench_one_batch --model /shared/public/models/Meta-Llama-3-8B-Instruct --batch-size 16 --input 1024 --output 512 --attention-backend fa3

# Disabled Cuda Graph
Prefill. latency: 0.37211 s, throughput:  44029.64 token/s
Decode.  median latency: 0.01244 s, median throughput:   1286.33 token/s
Total. latency:  6.756 s, throughput:   3637.85 token/s

# Enabled Cuda Graph
Prefill. latency: 0.37050 s, throughput:  44221.03 token/s
Decode.  median latency: 0.00814 s, median throughput:   1964.89 token/s
Total. latency:  4.527 s, throughput:   5428.93 token/s

Modifications

Modify seqlens_in_batch.max().item() to be seqlens_in_batch.max()

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhyncs · 2025-03-25T04:44:43Z

ref #4577

hebiao064 · 2025-03-25T04:45:47Z

ref #4577

interesting, let me do a profiling

zhyncs · 2025-03-25T04:45:56Z

https://github.com/sgl-project/sglang/blob/main/docs/references/benchmark_and_profiling.md#profile-with-pytorch-profiler

python/sglang/srt/layers/attention/flashattention_backend.py

Co-authored-by: Yubo Wang <yubowang2019@gmail.com>

hebiao064 · 2025-03-26T04:04:40Z

@merrymercy @zhyncs after investigation, we decided to keep the item() call, please review. We put the details in this PR's decsription

yubofredwang · 2025-03-26T05:08:41Z

Adding some context here.

According to profiling, item() is implicitly called whenever a tensor is getting used to slice/index a tensor. That is why there are 4 more Device to Host Copy.

Due to the operations after getting max_seq_len_k, there will be two additional kernel launches to do slicing of the page table.

metadata.max_seq_len_k = seqlens_in_batch.max()
metadata.page_table[:, metadata.max_seq_len_k :].fill_(0)
metadata.page_table[:, : metadata.max_seq_len_k].copy_(
    self.req_to_token[req_pool_indices[:bs], : metadata.max_seq_len_k]
)

The operation can be potentially fused into one where operation with a mask. Doing benchmark on that.

python/sglang/srt/layers/attention/flashattention_backend.py

hebiao064 · 2025-03-26T16:58:52Z

Test errors are mostly like this

huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf

not related to my change

…t#4745) Co-authored-by: Yubo Wang <yubowang2019@gmail.com>

Remove Unnecessary Device Sync

cca9b30

hebiao064 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners March 25, 2025 04:39

hebiao064 changed the title ~~Remove Unnecessary Device Sync~~ [FA3 Attn Backend] Remove Unnecessary Device Sync Mar 25, 2025

zhyncs added the high priority label Mar 25, 2025

hebiao064 changed the title ~~[FA3 Attn Backend] Remove Unnecessary Device Sync~~ [Do not merge][FA3 Attn Backend] Remove Unnecessary Device Sync Mar 25, 2025

hebiao064 marked this pull request as draft March 25, 2025 07:06

fix

160c312

hebiao064 force-pushed the bhe/remove_device_sync branch from c29625c to 160c312 Compare March 25, 2025 07:08

hebiao064 commented Mar 25, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

hebiao064 removed the high priority label Mar 26, 2025

hebiao064 and others added 2 commits March 25, 2025 20:54

Merge branch 'main' into bhe/remove_device_sync

6883b53

add item() back

aa4db97

Co-authored-by: Yubo Wang <yubowang2019@gmail.com>

hebiao064 changed the title ~~[Do not merge][FA3 Attn Backend] Remove Unnecessary Device Sync~~ [FA3 Attn Backend] Use extend_seq_lens instead of seqlens_in_batch for prefill Mar 26, 2025

hebiao064 marked this pull request as ready for review March 26, 2025 03:58

Merge branch 'main' into bhe/remove_device_sync

28eb00e

zhyncs assigned merrymercy and ispobock Mar 26, 2025

yubofredwang reviewed Mar 26, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

yubofredwang reviewed Mar 26, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

Merge branch 'main' into bhe/remove_device_sync

2cf2d87

hebiao064 added 3 commits March 26, 2025 21:49

Use seq_lens_cpu instead of seq_lens on GPU

c4dab88

add bs back

205ef59

fix

605ea1c

hebiao064 requested review from hnyls2002 and xiezhq-hermann as code owners March 27, 2025 00:25

hebiao064 changed the title ~~[FA3 Attn Backend] Use extend_seq_lens instead of seqlens_in_batch for prefill~~ [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 Mar 27, 2025

hebiao064 added 2 commits March 27, 2025 00:37

remove comment

6b34989

use .max instead of max()

92e89ce

hebiao064 mentioned this pull request Mar 27, 2025

Support FA3 as Attention backend by using --attention-backend fa3 #4680

Merged

11 tasks

Merge branch 'main' into bhe/remove_device_sync

5dc3d37

zhyncs merged commit 1b9175c into sgl-project:main Mar 27, 2025
0 of 19 checks passed

hebiao064 mentioned this pull request Mar 30, 2025

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Closed

15 tasks

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (sgl-projec…

8b2d1ba

…t#4745) Co-authored-by: Yubo Wang <yubowang2019@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

Uh oh!

hebiao064 commented Mar 25, 2025 •

edited

Loading

Uh oh!

zhyncs commented Mar 25, 2025

Uh oh!

hebiao064 commented Mar 25, 2025

Uh oh!

zhyncs commented Mar 25, 2025

Uh oh!

Uh oh!

hebiao064 commented Mar 26, 2025

Uh oh!

yubofredwang commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

Uh oh!

Conversation

hebiao064 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

Modifications

Checklist

Uh oh!

zhyncs commented Mar 25, 2025

Uh oh!

hebiao064 commented Mar 25, 2025

Uh oh!

zhyncs commented Mar 25, 2025

Uh oh!

Uh oh!

hebiao064 commented Mar 26, 2025

Uh oh!

yubofredwang commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 25, 2025 •

edited

Loading

yubofredwang commented Mar 26, 2025 •

edited

Loading