Skip to content

[Feature] support torchao for qwen2 models #2219

@tricky61

Description

@tricky61

I used one A30 card, and used Qwen2-7B-Instruct, the speed with quantization seems no different

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100
Benchmark ...
Prefill. latency: 0.03508 s, throughput: 5700.84 token/s
Decode. latency: 0.01952 s, throughput: 51.23 token/s
Decode. latency: 0.01947 s, throughput: 51.37 token/s
Decode. latency: 0.01939 s, throughput: 51.58 token/s
Decode. latency: 0.01933 s, throughput: 51.74 token/s
Decode. latency: 0.01928 s, throughput: 51.87 token/s
Decode. median latency: 0.01924 s, median throughput: 51.98 token/s
Total. latency: 1.942 s, throughput: 154.52 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile
Benchmark ...
Prefill. latency: 0.03655 s, throughput: 5471.84 token/s
Decode. latency: 0.01852 s, throughput: 54.00 token/s
Decode. latency: 0.01847 s, throughput: 54.14 token/s
Decode. latency: 0.01845 s, throughput: 54.21 token/s
Decode. latency: 0.01843 s, throughput: 54.26 token/s
Decode. latency: 0.01838 s, throughput: 54.39 token/s
Decode. median latency: 0.01836 s, median throughput: 54.46 token/s
Total. latency: 1.855 s, throughput: 161.71 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile --torchao-config int8wo
Benchmark ...
Prefill. latency: 0.04469 s, throughput: 4475.31 token/s
Decode. latency: 0.01860 s, throughput: 53.77 token/s
Decode. latency: 0.01849 s, throughput: 54.09 token/s
Decode. latency: 0.01844 s, throughput: 54.24 token/s
Decode. latency: 0.01841 s, throughput: 54.32 token/s
Decode. latency: 0.01837 s, throughput: 54.45 token/s
Decode. median latency: 0.01836 s, median throughput: 54.46 token/s
Total. latency: 1.863 s, throughput: 160.99 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile --torchao-config int4wo
Benchmark ...
Prefill. latency: 0.03558 s, throughput: 5621.52 token/s
Decode. latency: 0.01855 s, throughput: 53.91 token/s
Decode. latency: 0.01852 s, throughput: 54.01 token/s
Decode. latency: 0.01845 s, throughput: 54.20 token/s
Decode. latency: 0.01842 s, throughput: 54.28 token/s
Decode. latency: 0.01841 s, throughput: 54.33 token/s
Decode. median latency: 0.01837 s, median throughput: 54.44 token/s
Total. latency: 1.855 s, throughput: 161.72 token/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions