Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Aug 12, 2025

Motivation

Support Triton FP8 Gemm can handle hidden_dim not divisible by 16

Without this change, we can only server GLM Air with TP 4, not TP 8

Modifications

Accuracy Tests

python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air-FP8/ --tp 8


python3 bench_sglang.py --num-questions 200


Accuracy: 0.930
Invalid: 0.000
Latency: 13.808 s
Output throughput: 1456.212 token/s

Benchmarking and Profiling

Checklist

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BBuf
Copy link
Collaborator

BBuf commented Aug 12, 2025

Should we need a unit-test for the kernel?

@hebiao064
Copy link
Collaborator Author

@BBuf @zhyncs fyi

@hebiao064
Copy link
Collaborator Author

Should we need a unit-test for the kernel?

added

@BBuf BBuf enabled auto-merge (squash) August 13, 2025 03:18
@BBuf BBuf disabled auto-merge August 13, 2025 03:18
@hebiao064 hebiao064 self-assigned this Aug 13, 2025
@hebiao064 hebiao064 added quant LLM Quantization ready-to-merge The PR is ready to merge after the CI is green. labels Aug 13, 2025
@hebiao064 hebiao064 merged commit 930fe46 into main Aug 13, 2025
52 of 67 checks passed
@hebiao064 hebiao064 deleted the support_triton_gemm branch August 13, 2025 04:21
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
…l-project#9093)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quant LLM Quantization ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants