Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Mar 17, 2025

Motivation

Closes: #4434

We found that online quantization of w8a8 is not supported, #4434 directly load a fp16 model with --quantization w8a8_fp8 will load a model with all weights as zero hence the benchmark result is nearly 0 for GSM8K.

This PR added such support and showcased reasonable benchmark result across different scenarios.

Benchmark and Testing

GSM8K

Online Quantization

  1. Online Quantize a non-quantized model with --quantization w8a8_fp8: 0.789

    ```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization w8a8_fp8 python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

    Accuracy: 0.789
    Invalid: 0.002
    Latency: 17.456 s
    Output throughput: 7721.450 token/s

    Accuracy: 0.789
    Invalid: 0.002
    Latency: 17.199 s
    Output throughput: 7876.126 token/s

    </details>
    
    
  2. Online Quantize a non-quantized model with --quantization fp8: 0.789

    ```bash python3 -m sglang.launch_server --model /shared/public/models/Meta-Llama-3-8B-Instruct --quantization fp8 --trust-remote-code python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

    Accuracy: 0.789
    Invalid: 0.001
    Latency: 17.623 s
    Output throughput: 7737.012 token/s

    Accuracy: 0.790
    Invalid: 0.002
    Latency: 17.356 s
    Output throughput: 7759.684 token/s

    </details>
    
    
  3. No Online Quantization: directly serve a non-quantized model as baseline: 0.791

    ```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

    Accuracy: 0.791
    Invalid: 0.001
    Latency: 19.865 s
    Output throughput: 6720.907 token/s

    Accuracy: 0.787
    Invalid: 0.001
    Latency: 19.489 s
    Output throughput: 6838.216 token/s

    </details>
    
    
    

Offline Quantization (As sanity check)

  1. Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.784

    ```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code

    Accuracy: 0.784
    Invalid: 0.002
    Latency: 17.558 s
    Output throughput: 7646.241 token/s

    </details>
    
    In this case, online serving will use vllm kernels
    
    
    
  2. Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override --quantization w8a8_fp8: 0.795

    ```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code --quantization w8a8_fp8

    Accuracy: 0.795
    Invalid: 0.001
    Latency: 17.199 s
    Output throughput: 7856.940 token/s

    </details>
    
    In this case, online serving will use sgl kernels which were optimized in recent versions.
    
    
    

MMLU

Online Quantization

  1. Unquantized model with --quantization w8a8_fp8: 0.682

    ```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization w8a8_fp8 Total latency: 41.558 Average accuracy: 0.682 ```
  2. Unquantized model with --quantization fp8: 0.682

    ```bash python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code --quantization fp8

    Total latency: 42.350
    Average accuracy: 0.682

    </details>
    
    
  3. Unquantized model without any online quantization (Baseline): 0.683

    ``` python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Meta-Llama-3.1-8B-Instruct --trust-remote-code

    Total latency: 52.971
    Average accuracy: 0.683

    </details>
    
    
    

Offline Quantization (As sanity check)

  1. Directly serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe: 0.683

    ```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code python /home/jobuser/sglang/benchmark/mmlu/bench_sglang.py --nsub 64

    Total latency: 41.670
    Average accuracy: 0.683

    </details>
    
    In this case, online serving will use vllm kernels
    
    
    
  2. Serve a pre-quantized model from Neural Magic with Dynamic Quantization recipe + online quantization override --quantization w8a8_fp8: 0.682

    ```bash python3 -m sglang.launch_server --model /shared/public/models/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic --trust-remote-code --quantization w8a8_fp8 python /home/jobuser/sglang/benchmark/mmlu/bench_sglang.py --nsub 64

    Total latency: 40.486
    Average accuracy: 0.682

    </details>
    
    In this case, online serving will use sgl kernels which were optimized in recent versions.
    
    
    
    

Modifications

  • sglang/python/sglang/srt/layers/quantization/w8a8_fp8.py: Support online quantization while make sure offline quantization works as expected
  • Added test cases to ensure online quant only can perform as certain level, it's worth mention that the accuracy is higher than pure offline quantization. We believe our SGL Kernel for quant and matmul could deliver better precision.

Checklist

@hebiao064 hebiao064 marked this pull request as ready for review March 17, 2025 01:17
@hebiao064
Copy link
Collaborator Author

Ready for review @zhyncs @ispobock @HandH1998

@zhyncs zhyncs merged commit ef3c2dd into sgl-project:main Mar 17, 2025
1 of 4 checks passed
@zhyncs
Copy link
Member

zhyncs commented Mar 17, 2025

@hebiao064 @HandH1998 @ispobock Let's remove the vLLM's FP8 with TenorRT LLM's W8A8 FP8, as it can be used in all cases. What do you think?

@hebiao064
Copy link
Collaborator Author

@hebiao064 @HandH1998 @ispobock Let's remove the vLLM's FP8 with TenorRT LLM's W8A8 FP8, as it can be used in all cases. What do you think?

Could u please elaborate a little bit more? Like which function need to be replaced.
It would be nice if u can share the code pointer.

@yiakwy-xpu-ml-framework-team
Copy link
Contributor

yiakwy-xpu-ml-framework-team commented Mar 17, 2025

the datatype issue fixing [WIP] @hebiao064

RuntimeError: false INTERNAL ASSERT FAILED at "/app/pytorch/aten/src/ATen/hip/HIPDataType.h":102, please report a bug to PyTorch. Cannot convert ScalarType Float8_e4m3fn to hipDataType.

@Swipe4057
Copy link
Contributor

Do you recommend static AutoFP8 quantization, or does dynamic quantization currently have higher throughput?

@hebiao064
Copy link
Collaborator Author

Do you recommend static AutoFP8 quantization, or does dynamic quantization currently have higher throughput?

I am not familiar with AutoFP8 but in general I don't recommend static quantization with limited perf improvement and significant accuracy drop compared to dynamic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Accuracy] [Online Quantization] Llama 1B FP16/FP8/W8A8_FP8 accuracy
6 participants