Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Mar 5, 2025

Motivation

  1. Speed up fp8 per token quant kernel to accelerate fp8 model inference
  2. As part of the initiative of: [Feature] remove vllm _custom_ops #2965

Referred BBUF's Perf Improvement PR of per-tensor-quant-kernel: #3786

Achievement:

The SGL Kernel achieves an average performance improvement of approximately 5.5% across all tested configurations compared to VLLM. For larger workloads (e.g., seq_len=4096, batch_size=128), improvements reach up to 9%, demonstrating that the kernel successfully accelerates FP8 quantization for scenarios typical in model inference.

Benchmark Result:

Scale difference: 2.3283064365386963e-10
Output difference: 0.0
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    26.624000    30.560000
1         16.0    128.0    44.767998    42.016000
2         16.0    256.0    83.935998    84.063999
3         16.0    512.0   155.328006   156.864002
4         16.0   1024.0   297.567993   290.304005
5         16.0   2048.0   581.856012   550.303996
6         16.0   4096.0  1154.432058  1074.591994
7         32.0     64.0    44.672001    42.048000
8         32.0    128.0    84.927998    85.088000
9         32.0    256.0   156.208009   157.823995
10        32.0    512.0   297.567993   291.391999
11        32.0   1024.0   581.695974   550.368011
12        32.0   2048.0  1153.120041  1074.192047
13        32.0   4096.0  2290.767908  2109.632015
14        64.0     64.0    84.959999    83.903998
15        64.0    128.0   156.287998   158.207998
16        64.0    256.0   297.311991   290.784001
17        64.0    512.0   581.215978   551.648021
18        64.0   1024.0  1153.983951  1074.751973
19        64.0   2048.0  2289.776087  2110.111952
20        64.0   4096.0  4563.488007  4179.967880
21       128.0     64.0   156.128004   157.951996
22       128.0    128.0   297.311991   290.784001
23       128.0    256.0   581.215978   551.455975
24       128.0    512.0  1153.856039  1075.040102
25       128.0   1024.0  2289.727926  2109.247923
26       128.0   2048.0  4564.032078  4179.776192
27       128.0   4096.0  9105.535507  8318.592072

Modifications

Add sgl_per_token_quant_fp8, followed the sgl_per_tensor_quant_fp8 PR: https://github.com/sgl-project/sglang/pull/3786/files#diff-38cb3d5ce48e039d80415c9934da1c819853855e4a2421a6b8439c357c18c5df

Checklist

@hebiao064 hebiao064 marked this pull request as ready for review March 6, 2025 05:43
@BBuf BBuf mentioned this pull request Mar 5, 2025
18 tasks
@zhyncs zhyncs merged commit 63ee26d into sgl-project:main Mar 7, 2025
8 checks passed
aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants