Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Mar 25, 2025

Motivation

  • Remove Unnecessary Device Sync from this comment
  • We found that when extend_no_prefix is false, we used the seqlens_in_batch.max.item as max_seq_len_q, which is not quite right, we should use extend_no_prefix

Thanks @Fridge003 on helping out by providing profiling tips.

Before: 39us
Screenshot 2025-03-26 at 8 47 46 PM

After: 18us
Screenshot 2025-03-26 at 8 47 30 PM

Benchmark

(venv) jobuser [ ~/sglang ]$ python benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████████████████| 1319/1319 [00:19<00:00, 68.15it/s]
Accuracy: 0.793
Invalid: 0.002
Latency: 17.409 s
Output throughput: 6893.539 token/s
python -m sglang.bench_one_batch --model /shared/public/models/Meta-Llama-3-8B-Instruct --batch-size 16 --input 1024 --output 512 --attention-backend fa3

# Disabled Cuda Graph
Prefill. latency: 0.37211 s, throughput:  44029.64 token/s
Decode.  median latency: 0.01244 s, median throughput:   1286.33 token/s
Total. latency:  6.756 s, throughput:   3637.85 token/s

# Enabled Cuda Graph
Prefill. latency: 0.37050 s, throughput:  44221.03 token/s
Decode.  median latency: 0.00814 s, median throughput:   1964.89 token/s
Total. latency:  4.527 s, throughput:   5428.93 token/s

Modifications

Modify seqlens_in_batch.max().item() to be seqlens_in_batch.max()

Checklist

@hebiao064 hebiao064 changed the title Remove Unnecessary Device Sync [FA3 Attn Backend] Remove Unnecessary Device Sync Mar 25, 2025
@zhyncs
Copy link
Member

zhyncs commented Mar 25, 2025

ref #4577

@hebiao064
Copy link
Collaborator Author

ref #4577

interesting, let me do a profiling

@zhyncs
Copy link
Member

zhyncs commented Mar 25, 2025

@hebiao064 hebiao064 changed the title [FA3 Attn Backend] Remove Unnecessary Device Sync [Do not merge][FA3 Attn Backend] Remove Unnecessary Device Sync Mar 25, 2025
@hebiao064 hebiao064 marked this pull request as draft March 25, 2025 07:06
@hebiao064 hebiao064 force-pushed the bhe/remove_device_sync branch from c29625c to 160c312 Compare March 25, 2025 07:08
hebiao064 and others added 2 commits March 25, 2025 20:54
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
@hebiao064 hebiao064 changed the title [Do not merge][FA3 Attn Backend] Remove Unnecessary Device Sync [FA3 Attn Backend] Use extend_seq_lens instead of seqlens_in_batch for prefill Mar 26, 2025
@hebiao064 hebiao064 marked this pull request as ready for review March 26, 2025 03:58
@hebiao064
Copy link
Collaborator Author

@merrymercy @zhyncs after investigation, we decided to keep the item() call, please review. We put the details in this PR's decsription

@yubofredwang
Copy link
Contributor

yubofredwang commented Mar 26, 2025

Adding some context here.

According to profiling, item() is implicitly called whenever a tensor is getting used to slice/index a tensor. That is why there are 4 more Device to Host Copy.

Due to the operations after getting max_seq_len_k, there will be two additional kernel launches to do slicing of the page table.

metadata.max_seq_len_k = seqlens_in_batch.max()
metadata.page_table[:, metadata.max_seq_len_k :].fill_(0)
metadata.page_table[:, : metadata.max_seq_len_k].copy_(
    self.req_to_token[req_pool_indices[:bs], : metadata.max_seq_len_k]
)
profile_result

The operation can be potentially fused into one where operation with a mask. Doing benchmark on that.

@hebiao064
Copy link
Collaborator Author

Test errors are mostly like this

huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf

not related to my change

@hebiao064 hebiao064 changed the title [FA3 Attn Backend] Use extend_seq_lens instead of seqlens_in_batch for prefill [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 Mar 27, 2025
@zhyncs zhyncs merged commit 1b9175c into sgl-project:main Mar 27, 2025
0 of 19 checks passed
jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants