Skip to content

[Accuracy] [Online Quantization] Llama 1B FP16/FP8/W8A8_FP8 accuracy #4434

@hebiao064

Description

@hebiao064

Conclusion

W8A8_FP8 quantization doesn't support online quantization

GSM8K

Preparation

curl -o test.jsonl https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

kubectl cp /Users/bhe/Desktop/oss/data/gsm8k/test.jsonl nfs_host:/shared/public/data/gsm8k/test.jsonl

FP16 Baseline:

python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████| 1319/1319 [00:10<00:00, 121.39it/s]
Accuracy: 0.396
Invalid: 0.003
Latency: 10.905 s
Output throughput: 11035.006 token/s

FP8

python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --quantization fp8 --trust-remote-code


python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|████████████████████████████████████| 1319/1319 [00:10<00:00, 129.00it/s]
Accuracy: 0.376
Invalid: 0.001
Latency: 10.270 s
Output throughput: 11708.710 token/s

W8A8_FP8

python3 -m sglang.launch_server --model /shared/public/models/meta-llama/Llama-3.2-1B-Instruct --quantization w8a_fp8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████| 1319/1319 [00:38<00:00, 34.36it/s]
Accuracy: 0.003
Invalid: 0.284
Latency: 38.425 s
Output throughput: 17575.022 token/s

MMLU to be added tonight

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions