Add sgl_per_token_quant_fp8 #4089

hebiao064 · 2025-03-05T08:28:21Z

Motivation

Speed up fp8 per token quant kernel to accelerate fp8 model inference
As part of the initiative of: [Feature] remove vllm _custom_ops #2965

Referred BBUF's Perf Improvement PR of per-tensor-quant-kernel: #3786

Achievement:

The SGL Kernel achieves an average performance improvement of approximately 5.5% across all tested configurations compared to VLLM. For larger workloads (e.g., seq_len=4096, batch_size=128), improvements reach up to 9%, demonstrating that the kernel successfully accelerates FP8 quantization for scenarios typical in model inference.

Benchmark Result:

Scale difference: 2.3283064365386963e-10
Output difference: 0.0
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    26.624000    30.560000
1         16.0    128.0    44.767998    42.016000
2         16.0    256.0    83.935998    84.063999
3         16.0    512.0   155.328006   156.864002
4         16.0   1024.0   297.567993   290.304005
5         16.0   2048.0   581.856012   550.303996
6         16.0   4096.0  1154.432058  1074.591994
7         32.0     64.0    44.672001    42.048000
8         32.0    128.0    84.927998    85.088000
9         32.0    256.0   156.208009   157.823995
10        32.0    512.0   297.567993   291.391999
11        32.0   1024.0   581.695974   550.368011
12        32.0   2048.0  1153.120041  1074.192047
13        32.0   4096.0  2290.767908  2109.632015
14        64.0     64.0    84.959999    83.903998
15        64.0    128.0   156.287998   158.207998
16        64.0    256.0   297.311991   290.784001
17        64.0    512.0   581.215978   551.648021
18        64.0   1024.0  1153.983951  1074.751973
19        64.0   2048.0  2289.776087  2110.111952
20        64.0   4096.0  4563.488007  4179.967880
21       128.0     64.0   156.128004   157.951996
22       128.0    128.0   297.311991   290.784001
23       128.0    256.0   581.215978   551.455975
24       128.0    512.0  1153.856039  1075.040102
25       128.0   1024.0  2289.727926  2109.247923
26       128.0   2048.0  4564.032078  4179.776192
27       128.0   4096.0  9105.535507  8318.592072

Modifications

Add sgl_per_token_quant_fp8, followed the sgl_per_tensor_quant_fp8 PR: https://github.com/sgl-project/sglang/pull/3786/files#diff-38cb3d5ce48e039d80415c9934da1c819853855e4a2421a6b8439c357c18c5df

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

sgl-kernel/src/sgl-kernel/csrc/gemm/per_token_quant_fp8.cu

sgl-kernel/benchmark/bench_per_token_quant_fp8.py

sgl-kernel/tests/test_per_token_quant_fp8.py

hebiao064 added 2 commits March 5, 2025 00:28

sgl_per_token_quant_fp8

c61f290

Add Benchmark and Test

69fd91d

hebiao064 marked this pull request as ready for review March 6, 2025 05:43

hebiao064 requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077 and merrymercy as code owners March 6, 2025 05:43

Merge branch 'main' into sgl_per_token_quant_fp8

b367964

BBuf mentioned this pull request Mar 5, 2025

[Feature] remove vllm _custom_ops #2965

Closed

18 tasks

BBuf reviewed Mar 6, 2025

View reviewed changes

sgl-kernel/src/sgl-kernel/csrc/gemm/per_token_quant_fp8.cu Outdated Show resolved Hide resolved

hebiao064 added 2 commits March 5, 2025 23:16

Speed up SGLang Kernel to beat vLLM Kernel

5b713f8

Merge branch 'main' into sgl_per_token_quant_fp8

f4914a5

BBuf reviewed Mar 6, 2025

View reviewed changes

sgl-kernel/benchmark/bench_per_token_quant_fp8.py Show resolved Hide resolved

Stefan He added 2 commits March 7, 2025 00:31

Remove Static Kernel

5f8a19e

remove unncessary code and add larger context size

a1f1ef7

BBuf reviewed Mar 7, 2025

View reviewed changes

sgl-kernel/tests/test_per_token_quant_fp8.py Outdated Show resolved Hide resolved

Stefan He and others added 4 commits March 7, 2025 01:47

fix perf

89a17fa

remove print log

9bb7e15

Merge branch 'main' into sgl_per_token_quant_fp8

8606fb9

Merge branch 'main' into sgl_per_token_quant_fp8

0fb59b2

zhyncs merged commit 63ee26d into sgl-project:main Mar 7, 2025
8 checks passed

HandH1998 mentioned this pull request Mar 7, 2025

Apply sgl w8a8 fp8 kernel #3148

Merged

hebiao064 mentioned this pull request Mar 9, 2025

Accelerate FP8 CUDA Kernel by 20-28% #4215

Merged

6 tasks

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

Add sgl_per_token_quant_fp8 (sgl-project#4089)

b0c9874

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sgl_per_token_quant_fp8 #4089

Add sgl_per_token_quant_fp8 #4089

hebiao064 commented Mar 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add sgl_per_token_quant_fp8 #4089

Add sgl_per_token_quant_fp8 #4089

Conversation

hebiao064 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Achievement:

Benchmark Result:

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Mar 5, 2025 •

edited

Loading