-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support Online Quantization for W8A8 #4485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ready for review @zhyncs @ispobock @HandH1998 |
@hebiao064 @HandH1998 @ispobock Let's remove the vLLM's FP8 with TenorRT LLM's W8A8 FP8, as it can be used in all cases. What do you think? |
Could u please elaborate a little bit more? Like which function need to be replaced. |
the datatype issue fixing [WIP] @hebiao064
|
Do you recommend static AutoFP8 quantization, or does dynamic quantization currently have higher throughput? |
I am not familiar with AutoFP8 but in general I don't recommend static quantization with limited perf improvement and significant accuracy drop compared to dynamic. |
Motivation
Closes: #4434
We found that online quantization of w8a8 is not supported, #4434 directly load a fp16 model with
--quantization w8a8_fp8
will load a model with all weights as zero hence the benchmark result is nearly 0 for GSM8K.This PR added such support and showcased reasonable benchmark result across different scenarios.
Benchmark and Testing
GSM8K
Online Quantization
Online Quantize a non-quantized model with
--quantization w8a8_fp8
: 0.789Accuracy: 0.789
Invalid: 0.002
Latency: 17.456 s
Output throughput: 7721.450 token/s
Accuracy: 0.789
Invalid: 0.002
Latency: 17.199 s
Output throughput: 7876.126 token/s
Online Quantize a non-quantized model with
--quantization fp8
: 0.789Accuracy: 0.789
Invalid: 0.001
Latency: 17.623 s
Output throughput: 7737.012 token/s
Accuracy: 0.790
Invalid: 0.002
Latency: 17.356 s
Output throughput: 7759.684 token/s
No Online Quantization: directly serve a non-quantized model as baseline: 0.791
Accuracy: 0.791
Invalid: 0.001
Latency: 19.865 s
Output throughput: 6720.907 token/s
Accuracy: 0.787
Invalid: 0.001
Latency: 19.489 s
Output throughput: 6838.216 token/s
Offline Quantization (As sanity check)
Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.784
Accuracy: 0.784
Invalid: 0.002
Latency: 17.558 s
Output throughput: 7646.241 token/s
Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override
--quantization w8a8_fp8
: 0.795Accuracy: 0.795
Invalid: 0.001
Latency: 17.199 s
Output throughput: 7856.940 token/s
MMLU
Online Quantization
Unquantized model with
--quantization w8a8_fp8
: 0.682Unquantized model with
--quantization fp8
: 0.682Total latency: 42.350
Average accuracy: 0.682
Unquantized model without any online quantization (Baseline): 0.683
Total latency: 52.971
Average accuracy: 0.683
Offline Quantization (As sanity check)
Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.683
Total latency: 41.670
Average accuracy: 0.683
Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override
--quantization w8a8_fp8
: 0.682Total latency: 40.486
Average accuracy: 0.682
Modifications
Checklist