Support Online Quantization for W8A8 #4485

hebiao064 · 2025-03-17T00:23:25Z

Motivation

Closes: #4434

We found that online quantization of w8a8 is not supported, #4434 directly load a fp16 model with --quantization w8a8_fp8 will load a model with all weights as zero hence the benchmark result is nearly 0 for GSM8K.

This PR added such support and showcased reasonable benchmark result across different scenarios.

Benchmark and Testing

GSM8K

Online Quantization

Online Quantize a non-quantized model with --quantization w8a8_fp8: 0.789
```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization w8a8_fp8 python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.789
Invalid: 0.002
Latency: 17.456 s
Output throughput: 7721.450 token/s

Accuracy: 0.789
Invalid: 0.002
Latency: 17.199 s
Output throughput: 7876.126 token/s
```
</details>
```
Online Quantize a non-quantized model with --quantization fp8: 0.789
```bash python3 -m sglang.launch_server --model /shared/public/models/Meta-Llama-3-8B-Instruct --quantization fp8 --trust-remote-code python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.789
Invalid: 0.001
Latency: 17.623 s
Output throughput: 7737.012 token/s

Accuracy: 0.790
Invalid: 0.002
Latency: 17.356 s
Output throughput: 7759.684 token/s
```
</details>
```
No Online Quantization: directly serve a non-quantized model as baseline: 0.791
```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.791
Invalid: 0.001
Latency: 19.865 s
Output throughput: 6720.907 token/s

Accuracy: 0.787
Invalid: 0.001
Latency: 19.489 s
Output throughput: 6838.216 token/s
```
</details>
```

Offline Quantization (As sanity check)

Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.784
```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code
Accuracy: 0.784
Invalid: 0.002
Latency: 17.558 s
Output throughput: 7646.241 token/s
```
</details>

In this case, online serving will use vllm kernels
```
Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override --quantization w8a8_fp8: 0.795
```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code --quantization w8a8_fp8
Accuracy: 0.795
Invalid: 0.001
Latency: 17.199 s
Output throughput: 7856.940 token/s
```
</details>

In this case, online serving will use sgl kernels which were optimized in recent versions.
```

MMLU

Online Quantization

Unquantized model with --quantization w8a8_fp8: 0.682

```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization w8a8_fp8 Total latency: 41.558 Average accuracy: 0.682 ```
Unquantized model with --quantization fp8: 0.682
```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization fp8
Total latency: 42.350
Average accuracy: 0.682
```
</details>
```
Unquantized model without any online quantization (Baseline): 0.683
``` python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code
Total latency: 52.971
Average accuracy: 0.683
```
</details>
```

Offline Quantization (As sanity check)

Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.683
```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code python /home/jobuser/sglang/benchmark/mmlu/bench_sglang.py --nsub 64
Total latency: 41.670
Average accuracy: 0.683
```
</details>

In this case, online serving will use vllm kernels
```
Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override --quantization w8a8_fp8: 0.682
```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code --quantization w8a8_fp8 python /home/jobuser/sglang/benchmark/mmlu/bench_sglang.py --nsub 64
Total latency: 40.486
Average accuracy: 0.682
```
</details>

In this case, online serving will use sgl kernels which were optimized in recent versions.
```

Modifications

sglang/python/sglang/srt/layers/quantization/w8a8_fp8.py: Support online quantization while make sure offline quantization works as expected
Added test cases to ensure online quant only can perform as certain level, it's worth mention that the accuracy is higher than pure offline quantization. We believe our SGL Kernel for quant and matmul could deliver better precision.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

test/srt/test_eval_fp8_accuracy.py

hebiao064 · 2025-03-17T06:01:27Z

Ready for review @zhyncs @ispobock @HandH1998

zhyncs · 2025-03-17T07:35:13Z

@hebiao064 @HandH1998 @ispobock Let's remove the vLLM's FP8 with TenorRT LLM's W8A8 FP8, as it can be used in all cases. What do you think?

hebiao064 · 2025-03-17T07:41:48Z

@hebiao064 @HandH1998 @ispobock Let's remove the vLLM's FP8 with TenorRT LLM's W8A8 FP8, as it can be used in all cases. What do you think?

Could u please elaborate a little bit more? Like which function need to be replaced.
It would be nice if u can share the code pointer.

python/sglang/srt/layers/quantization/w8a8_fp8.py

yiakwy-xpu-ml-framework-team · 2025-03-17T10:08:35Z

the datatype issue fixing [WIP] @hebiao064

RuntimeError: false INTERNAL ASSERT FAILED at "/app/pytorch/aten/src/ATen/hip/HIPDataType.h":102, please report a bug to PyTorch. Cannot convert ScalarType Float8_e4m3fn to hipDataType.

python/sglang/srt/layers/quantization/w8a8_fp8.py

Swipe4057 · 2025-03-17T16:10:30Z

Do you recommend static AutoFP8 quantization, or does dynamic quantization currently have higher throughput?

hebiao064 · 2025-03-17T17:34:06Z

Do you recommend static AutoFP8 quantization, or does dynamic quantization currently have higher throughput?

I am not familiar with AutoFP8 but in general I don't recommend static quantization with limited perf improvement and significant accuracy drop compared to dynamic.

hebiao064 added 3 commits March 17, 2025 00:14

Support Online Quantization for W8A8

d17598f

Fix format

d2969f7

Add test

a64a7f0

hebiao064 marked this pull request as ready for review March 17, 2025 01:17

hebiao064 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners March 17, 2025 01:17

hebiao064 commented Mar 17, 2025

View reviewed changes

test/srt/test_eval_fp8_accuracy.py Outdated Show resolved Hide resolved

Merge branch 'main' into support_online_quant_w8a8

6811b47

ispobock assigned ispobock and HandH1998 Mar 17, 2025

hebiao064 added 2 commits March 17, 2025 05:59

Add test

201abd5

Merge branch 'main' into support_online_quant_w8a8

e21cd21

zhyncs added the high priority label Mar 17, 2025

hebiao064 and others added 2 commits March 16, 2025 23:42

Merge branch 'main' into support_online_quant_w8a8

bc303f7

Merge branch 'main' into support_online_quant_w8a8

29ca479

ispobock approved these changes Mar 17, 2025

View reviewed changes

Merge branch 'main' into support_online_quant_w8a8

4e8fef3

zhyncs merged commit ef3c2dd into sgl-project:main Mar 17, 2025
1 of 4 checks passed

yiakwy-xpu-ml-framework-team reviewed Mar 17, 2025

View reviewed changes

python/sglang/srt/layers/quantization/w8a8_fp8.py Show resolved Hide resolved

yiakwy-xpu-ml-framework-team reviewed Mar 17, 2025

View reviewed changes

python/sglang/srt/layers/quantization/w8a8_fp8.py Show resolved Hide resolved

yiakwy-xpu-ml-framework-team mentioned this pull request Mar 17, 2025

[ROCm] fix dtype #4510

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Online Quantization for W8A8 #4485

Support Online Quantization for W8A8 #4485

Uh oh!

hebiao064 commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

zhyncs commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Swipe4057 commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

Support Online Quantization for W8A8 #4485

Support Online Quantization for W8A8 #4485

Uh oh!

Conversation

hebiao064 commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark and Testing

GSM8K

Online Quantization

Offline Quantization (As sanity check)

MMLU

Online Quantization

Offline Quantization (As sanity check)

Modifications

Checklist

Uh oh!

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

zhyncs commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Swipe4057 commented Mar 17, 2025

Uh oh!

hebiao064 commented Mar 17, 2025

Uh oh!

Uh oh!

hebiao064 commented Mar 17, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Mar 17, 2025 •

edited

Loading