-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Closed
Labels
Description
Conclusion
W8A8_FP8 quantization doesn't support online quantization
GSM8K
Preparation
curl -o test.jsonl https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
kubectl cp /Users/bhe/Desktop/oss/data/gsm8k/test.jsonl nfs_host:/shared/public/data/gsm8k/test.jsonl
FP16 Baseline:
python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████| 1319/1319 [00:10<00:00, 121.39it/s]
Accuracy: 0.396
Invalid: 0.003
Latency: 10.905 s
Output throughput: 11035.006 token/s
FP8
python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --quantization fp8 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████| 1319/1319 [00:10<00:00, 129.00it/s]
Accuracy: 0.376
Invalid: 0.001
Latency: 10.270 s
Output throughput: 11708.710 token/s
W8A8_FP8
python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --quantization w8a_fp8 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████| 1319/1319 [00:38<00:00, 34.36it/s]
Accuracy: 0.003
Invalid: 0.284
Latency: 38.425 s
Output throughput: 17575.022 token/s
MMLU to be added tonight
zhyncs and qingquansong