[Feature] support torchao for qwen2 models

I used one A30 card, and used Qwen2-7B-Instruct, the speed with quantization seems no different

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100
Benchmark ...
Prefill. latency: 0.03508 s, throughput:   5700.84 token/s
Decode.  latency: 0.01952 s, throughput:     51.23 token/s
Decode.  latency: 0.01947 s, throughput:     51.37 token/s
Decode.  latency: 0.01939 s, throughput:     51.58 token/s
Decode.  latency: 0.01933 s, throughput:     51.74 token/s
Decode.  latency: 0.01928 s, throughput:     51.87 token/s
Decode.  median latency: 0.01924 s, median throughput:     51.98 token/s
Total. latency:  1.942 s, throughput:    154.52 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile
Benchmark ...
Prefill. latency: 0.03655 s, throughput:   5471.84 token/s
Decode.  latency: 0.01852 s, throughput:     54.00 token/s
Decode.  latency: 0.01847 s, throughput:     54.14 token/s
Decode.  latency: 0.01845 s, throughput:     54.21 token/s
Decode.  latency: 0.01843 s, throughput:     54.26 token/s
Decode.  latency: 0.01838 s, throughput:     54.39 token/s
Decode.  median latency: 0.01836 s, median throughput:     54.46 token/s
Total. latency:  1.855 s, throughput:    161.71 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile --torchao-config int8wo
Benchmark ...
Prefill. latency: 0.04469 s, throughput:   4475.31 token/s
Decode.  latency: 0.01860 s, throughput:     53.77 token/s
Decode.  latency: 0.01849 s, throughput:     54.09 token/s
Decode.  latency: 0.01844 s, throughput:     54.24 token/s
Decode.  latency: 0.01841 s, throughput:     54.32 token/s
Decode.  latency: 0.01837 s, throughput:     54.45 token/s
Decode.  median latency: 0.01836 s, median throughput:     54.46 token/s
Total. latency:  1.863 s, throughput:    160.99 token/s

python3 -m sglang.bench_latency --model ../Qwen2-7B-Instruct --batch-size 1 --input-len 200 --output-len 100 --enable-torch-compile --torchao-config int4wo
Benchmark ...
Prefill. latency: 0.03558 s, throughput:   5621.52 token/s
Decode.  latency: 0.01855 s, throughput:     53.91 token/s
Decode.  latency: 0.01852 s, throughput:     54.01 token/s
Decode.  latency: 0.01845 s, throughput:     54.20 token/s
Decode.  latency: 0.01842 s, throughput:     54.28 token/s
Decode.  latency: 0.01841 s, throughput:     54.33 token/s
Decode.  median latency: 0.01837 s, median throughput:     54.44 token/s
Total. latency:  1.855 s, throughput:    161.72 token/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] support torchao for qwen2 models #2219

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] support torchao for qwen2 models #2219

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions