简体中文 | English
📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat (微信) | 🫨 Discord
- [25/09/01] 我们支持了Hunyuan-MT-7B翻译开源模型的FP8量化;支持了Eagle3的Torch推理及Benchmark评测流程;支持了FLUX的量化、Cache;支持了Seed-OSS模型量化压缩。
- [25/08/06] 我们支持了
Hunyuan 0.5B/1.8B/4B/7B
和Qwen2.5VL 3B/7B/32B/72B
的FP8、INT4量化,支持了DeepSeek-R1/V3
和Kimi-K2
模型的FP8-Static
、W4A8-FP8
量化。我们还开源了Hunyuan 1.8B/4B/7B
系列模型的Eagle3权重。 - [25/07/04] 我们支持了
Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen
等模型的量化,包含INT8、FP8、INT4等算法。 我们还开源了Qwen3
系列模型的Eagle3权重。
Coming soon:
- Diffusion模型压缩支持
- 投机采样新算法发布
- 高度集成化:本工具将主流的压缩算法集成到工具,开发者可一键式调用,具有很好的易用性。
- 持续算法创新:本工具除了集成工业界使用最广的算法,还持续自研更好的压缩算法,并且会陆续开源。
- 追求极致性能:在模型压缩流程、压缩算法部署方面,本工具持续端到端优化,例如单卡GPU可量化Qwen3-235B和Deepseek-R1。
目前已支持文生文任务Hunyuan-Dense、Hunyuan-MoE、Qwen3-Dense、Qwen3-MoE、Qwen2.5、DeepSeek-R1蒸馏Qwen模型、QwQ等系列的主要模型:
模型名 | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
---|---|---|---|---|---|
Hunyuan-Dense | ✅ | ✅ | ✅ | ✅ | ✅ |
Hunyuan-MoE | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen3-Dense | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen3-MoE | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen2.5 | ✅ | ✅ | ✅ | ✅ | ✅ |
DeepSeek-R1-Distill-Qwen | ✅ | ✅ | ✅ | ✅ | ✅ |
QwQ | ✅ | ✅ | ✅ | ✅ | ✅ |
目前已开源Qwen3和Hunyuan系列模型的Eagle3权重。
Qwen3 Models | Hunyuan Models |
---|---|
✅ Qwen3-1.7B | ✅ Hunyuan-1.8B-Instruct |
✅ Qwen3-4B | ✅ Hunyuan-4B-Instruct |
✅ Qwen3-8B | ✅ Hunyuan-7B-Instruct |
✅ Qwen3-14B | |
✅ Qwen3-32B | |
✅ Qwen3-30B-A3B |
推荐使用pip
直接安装最新稳定版AngelSlim
:
pip install angelslim
也可以选择克隆代码仓库后,以可编辑的方式从源代码安装:
cd AngelSlim && python setup.py install
更详细的安装说明可参考安装文档。
完成安装AngelSlim
后,您可以通过以下脚本快速开始,完成Qwen3-1.7B
模型的静态FP8
量化:
-
一键式启动
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
该示例将会加载
HugggingFace
模型, 使用config
配置的dataset
数据进行激活值校准,量化产出模型权重. -
源码启动
对
Qwen3-1.7B
完成动态FP8
量化:from angelslim.engine import Engine slim_engine = Engine() # Prepare model slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",) # Initialize compressor slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic") # Compress model slim_engine.run() # Save compressed model slim_engine.save("./output")
详情请参考快速开始文档。
完成安装AngelSlim
后,您可以通过以下脚本快速开始,完成Eagle3
的Pytorch性能测试:
python3 tools/spec_benchmark.py \
--base-model-path /path/to/base/model \
--eagle-model-path /path/to/eagle/model \
--model-id your_model_id \
--mode both
详情请参考快速开始文档。
如果需要通过transformers
加载量化模型,请在量化模型配置的global
中设置deploy_backend: huggingface
,或者直接手动将量化产出模型路径下config.json
配置中的key ignored_layers
改为ignore
。
测试transformers
加载量化模型离线推理:
python deploy/offline.py $MODEL_PATH
其中 MODEL_PATH
为量化产出模型路径。
支持通过以下推理框架部署 OpenAI 兼容的 API 服务:
vLLM
vLLM 服务启动脚本,建议版本vllm>=0.8.5.post1
,部署MOE INT8量化模型需要vllm>=0.9.2
。
bash deploy/run_vllm.sh $MODEL_PATH
SGLang
SGLang 服务启动脚本,建议版本 sglang>=0.4.6.post1
:
bash deploy/run_sglang.sh $MODEL_PATH
通过 OpenAI 格式 接口发起请求:
bash deploy/openai.sh $MODEL_PATH
使用 lm-evaluation-harness 评估量化模型精度,建议版本lm-eval>=0.4.8
:
bash deploy/lm_eval.sh $MODEL_PATH
详细操作指南请参阅部署文档。
下面只展示了部分模型的效果测试情况,完整Benchmark可以参考Benchmark文档
Hunyuan-Instruct的BF16
、FP8
、INT4-GPTQ
、INT4-AWQ
在OlympiadBench
、AIME 2024
、DROP
、GPQA-Diamond
上的评测结果如下:
Model | Quantization | OlympiadBench | AIME 2024 | DROP | GPQA-Diamond |
---|---|---|---|---|---|
Hunyuan-A13B-Instruct | BF16 | 82.7 | 87.30 | 91.1 | 71.2 |
FP8-Static | 83.0 | 86.7 | 91.1 | - | |
Int4-GPTQ | 82.7 | 86.7 | 91.1 | - | |
Int4-AWQ | 82.6 | 85.6 | 91.0 | - | |
Hunyuan-7B-Instruct | BF16 | 76.5 | 81.1 | 85.9 | 60.1 |
FP8-Static | 76.6 | 80.9 | 86.0 | 60.1 | |
Int4-GPTQ | 76.2 | 81.0 | 85.7 | 60.0 | |
Int4-AWQ | 76.4 | 80.9 | 85.9 | 60.1 | |
Hunyuan-4B-Instruct | BF16 | 73.1 | 78.3 | 78.2 | 61.1 |
FP8-Static | 73.1 | 76.6 | 78.3 | 60.2 | |
Int4-GPTQ | 72.9 | - | 78.1 | 58.1 | |
Int4-AWQ | 72.8 | - | 78.2 | - | |
Hunyuan-1.8B-Instruct | BF16 | 63.4 | 56.7 | 76.7 | 47.2 |
FP8-Static | 62.5 | 55.2 | 75.1 | 47.7 | |
Int4-GPTQ | 60.9 | - | 73.0 | 44.4 | |
Int4-AWQ | 61.7 | - | 71.7 | 43.6 | |
Hunyuan-0.5B-Instruct | BF16 | 29.6 | 17.2 | 52.8 | 23.3 |
FP8-Static | 29.6 | 17.2 | 51.6 | 22.5 | |
Int4-GPTQ | 26.8 | - | 50.9 | 23.3 | |
Int4-AWQ | 26.3 | - | 48.9 | 23.3 |
Qwen3系列模型的BF16
、FP8-Static
、FP8-Dynamic
、INT8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在CEVAL
、MMLU
、GSM8K
、HUMANEVAL
上的评测结果如下:
Model | Quantization | CEVAL | MMLU | GSM8K | HUMANEVAL |
---|---|---|---|---|---|
Qwen3-0.6B | BF16 | 45.84 | 47.21 | 42.99 | 19.51 |
FP8-Static | 45.99 | 46.87 | 38.06 | 18.90 | |
FP8-Dynamic | 45.99 | 46.93 | 38.29 | 20.73 | |
INT8-Dynamic | 45.17 | 46.95 | 41.17 | 21.34 | |
Qwen3-8B | BF16 | 79.27 | 74.78 | 87.79 | 63.41 |
FP8-Static | 78.23 | 74.79 | 86.96 | 62.20 | |
FP8-Dynamic | 78.45 | 74.75 | 87.64 | 62.80 | |
INT8-Dynamic | 78.01 | 74.84 | 86.96 | 67.07 | |
INT4-GPTQ | 77.19 | 73.26 | 86.43 | 62.20 | |
INT4-AWQ | 76.15 | 73.59 | 86.96 | 63.41 | |
Qwen3-14B | BF16 | 83.06 | 78.90 | 88.40 | 55.49 |
FP8-Static | 82.62 | 78.57 | 89.46 | 57.32 | |
FP8-Dynamic | 82.24 | 78.92 | 88.32 | 52.44 | |
INT8-Dynamic | 81.87 | 78.13 | 86.28 | 56.10 | |
INT4-GPTQ | 81.05 | 78.02 | 87.34 | 57.93 | |
INT4-AWQ | 82.02 | 77.68 | 84.23 | 61.59 | |
Qwen3-32B | BF16 | 86.55 | 82.00 | 74.53 | 37.80 |
FP8-Static | 86.92 | 81.78 | 70.20 | 39.63 | |
FP8-Dynamic | 86.55 | 81.89 | 70.43 | 38.41 | |
INT4-GPTQ | 86.18 | 81.01 | - | 43.29 | |
INT4-AWQ | 86.18 | 81.54 | - | 36.59 | |
Qwen3-30B-A3B | BF16 | 83.66 | 79.36 | 89.99 | 31.71 |
FP8-Static | 83.95 | 79.47 | 89.01 | 31.10 | |
FP8-Dynamic | 84.10 | 79.40 | 89.16 | 32.93 | |
INT8-Dynamic | 83.36 | 79.48 | 89.16 | 34.15 | |
Qwen3-235B-A22B | BF16 | 89.60 | 86.28 | 85.29 | 27.44 |
FP8-Static | 89.67 | 86.19 | 86.96 | 27.44 | |
FP8-Dynamic | 89.67 | 86.18 | 85.22 | 28.05 | |
INT8-Dynamic | 88.93 | 86.20 | 86.20 | 23.78 | |
QwQ-32B | BF16 | 85.74 | 82.03 | 73.31 | 42.68 |
FP8-Static | 85.44 | 81.91 | 75.36 | 42.68 | |
FP8-Dynamic | 85.07 | 81.93 | 75.66 | 42.07 | |
INT4-GPTQ | 84.03 | 81.26 | 68.23 | 45.73 | |
INT4-AWQ | 83.58 | 81.01 | 68.69 | 43.29 |
Qwen2.5VL系列模型的BF16
、FP8-Static
、FP8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在MMMU_VAL
、DocVQA_VAL
、ChartQA_TEST
上的评测结果如下:
Model | Quantization | MMMU_VAL | MMLDocVQA_VALU | ChartQA_TEST |
---|---|---|---|---|
Qwen2.5VL-3B | BF16 | 47.11 | 78.57 | 80.32 |
FP8-Static | 47.33 | 79.34 | 79.68 | |
FP8-Dynamic | 45.99 | 46.93 | 38.29 | |
INT4-GPTQ | 46.56 | 77.20 | 78.96 | |
INT4-AWQ | 45.78 | - | 79.60 | |
Qwen2.5VL-7B | BF16 | 45.44 | 89.71 | 84.64 |
FP8-Static | 47.00 | 89.83 | 85.92 | |
FP8-Dynamic | 47.22 | 89.80 | 88.64 | |
INT4-GPTQ | 46.67 | 90.45 | - | |
INT4-AWQ | 45.67 | 89.28 | - | |
Qwen2.5VL-32B | BF16 | 57.00 | 90.03 | - |
FP8-Static | 57.00 | 89.88 | - | |
FP8-Dynamic | 56.44 | 89.88 | - | |
INT4-GPTQ | 55.22 | 89.80 | - | |
INT4-AWQ | 55.22 | 90.30 | - | |
Qwen2.5VL-72B | BF16 | 58.78 | 94.39 | 85.60 |
FP8-Static | 57.89 | 94.41 | 85.84 | |
FP8-Dynamic | 58.67 | 94.38 | 85.60 | |
INT4-GPTQ | 57.56 | 94.46 | 86.48 | |
INT4-AWQ | 58.78 | 94.19 | 87.28 |
DeepSeek-R1-0528模型的FP8-Block-Wise
、W4A8-FP8
在GPQA Diamond
、AIME 2024
、SimpleQA
、LiveCodeBench
上的评测结果如下:
Model | Quantization | GPQA Diamond | AIME 2024 | SimpleQA | LiveCodeBench |
---|---|---|---|---|---|
DeepSeek-R1-0528 | FP8-Block-Wise | 78.28 | 88.67 | 27.8 | 77.1 |
W4A8-FP8 | 77.37 | 88.67 | 26.83 | 78.86 |
备注:
- 以上评测结果使用TRT-LLM框架部署测试5次求平均
- 评测时使用的超参如下:
{ "top_k": 20, "top_p": 0.6, "temperature": 0.7, "output_seq_len": 32768, "max_input_seq_len": 16384 }
其他模型的BF16
、FP8-Static
、FP8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在CEVAL
、MMLU
、GSM8K
上的评测结果如下:
Model | Quantization | CEVAL | MMLU | GSM8K |
---|---|---|---|---|
Qwen2.5-1.5B-Instruct | BF16 | 67.01 | 60.05 | 54.28 |
FP8-Static | 66.27 | 60.23 | - | |
FP8-Dynamic | 66.79 | 60.08 | 51.71 | |
Qwen2.5-7B-Instruct | BF16 | 81.20 | 74.55 | 79.98 |
FP8-Static | 81.13 | 74.03 | 79.30 | |
FP8-Dynamic | 80.31 | 74.07 | 79.00 | |
INT4-GPTQ | 79.05 | 73.05 | 74.75 | |
INT4-AWQ | 79.35 | 73.22 | 79.38 | |
Qwen2.5-32B-Instruct | BF16 | 87.30 | 83.21 | 81.73 |
FP8-Static | 87.59 | 83.08 | 81.58 | |
FP8-Dynamic | 87.30 | 83.04 | 81.58 | |
INT4-GPTQ | 86.70 | 82.45 | 82.03 | |
INT4-AWQ | 87.00 | 82.64 | - | |
DeepSeek-R1-Distill-Qwen-7B | BF16 | 53.49 | 53.80 | 75.74 |
FP8-Static | 53.57 | 54.17 | 76.19 | |
FP8-Dynamic | 52.97 | 54.13 | 74.15 | |
INT4-GPTQ | 51.86 | 52.44 | 75.89 | |
INT4-AWQ | 53.49 | 53.70 | - | |
DeepSeek-R1-Distill-Qwen-14B | BF16 | 77.71 | 74.28 | 85.67 |
FP8-Static | 77.56 | 74.66 | 86.73 | |
FP8-Dynamic | 76.82 | 74.63 | 87.11 | |
INT4-GPTQ | 74.29 | 72.37 | 84.61 | |
INT4-AWQ | 74.81 | 73.00 | 86.05 | |
DeepSeek-R1-Distill-Qwen-32B | BF16 | 84.18 | 80.89 | 87.41 |
FP8-Static | 83.43 | 80.90 | 87.57 | |
FP8-Dynamic | 83.73 | 81.10 | 86.43 | |
INT4-GPTQ | 84.10 | 79.80 | 86.73 | |
INT4-AWQ | 82.84 | 80.15 | 87.19 |
Qwen3系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
T=0 | Qwen3-1.7B | 2.05x | 2.81 | 2.07x | 2.93 | 2.11x | 2.98 | 1.93x | 2.69 | 2.04x | 2.85 |
Qwen3-4B | 2.21x | 3.01 | 2.36x | 3.24 | 2.42x | 3.13 | 2.32x | 2.75 | 2.33x | 3.03 | |
Qwen3-8B | 2.63x | 3.65 | 2.76x | 3.85 | 2.82x | 3.90 | 2.62x | 3.48 | 2.70x | 3.72 | |
Qwen3-14B | 2.23x | 3.30 | 2.53x | 3.74 | 2.56x | 3.79 | 2.16x | 3.13 | 2.37x | 3.49 | |
Qwen3-32B | 2.39x | 2.78 | 2.37x | 2.81 | 2.47x | 2.92 | 2.42x | 2.53 | 2.41x | 2.76 | |
Qwen3-30B-A3B | 2.84x | 3.63 | 2.27x | 3.09 | 2.64x | 3.42 | 2.83x | 3.56 | 2.64x | 3.42 | |
T=1 | Qwen3-1.7B | 1.74x | 2.53 | 1.86x | 2.70 | 1.82x | 2.69 | 1.72x | 2.46 | 1.93x | 2.60 |
Qwen3-4B | 1.93x | 2.60 | 2.00x | 2.84 | 2.11x | 2.82 | 2.34x | 2.50 | 1.75x | 2.69 | |
Qwen3-8B | 1.98x | 2.75 | 2.25x | 3.11 | 2.31x | 3.15 | 2.10x | 2.76 | 2.90x | 2.94 | |
Qwen3-14B | 1.71x | 2.61 | 1.95x | 2.87 | 2.04x | 3.08 | 1.68x | 2.55 | 2.90x | 2.78 | |
Qwen3-32B | 1.62x | 1.91 | 1.71x | 2.05 | 1.78x | 2.10 | 1.80x | 1.95 | 1.62x | 2.00 | |
Qwen3-30B-A3B | 1.91x | 2.46 | 2.00x | 2.64 | 1.90x | 2.53 | 1.80x | 2.32 | 1.90x | 2.48 |
Hunyuan系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
T=0 | Hunyuan-1.8B-Instruct | 1.97x | 2.90 | 2.58x | 3.73 | 2.61x | 3.71 | 1.71x | 2.43 | 2.22x | 3.19 |
Hunyuan-4B-Instruct | 1.77x | 2.60 | 2.64x | 3.35 | 2.14x | 3.17 | 1.72x | 2.57 | 2.07x | 2.92 | |
Hunyuan-7B-Instruct | 2.22x | 3.58 | 3.59x | 5.47 | 2.96x | 4.68 | 1.64x | 2.56 | 2.60x | 4.07 | |
T=1 | Hunyuan-1.8B-Instruct | 1.58x | 2.36 | 2.35x | 3.56 | 2.23x | 3.38 | 1.26x | 1.87 | 1.86x | 2.79 |
Hunyuan-4B-Instruct | 1.36x | 2.05 | 1.97x | 2.86 | 1.72x | 2.68 | 1.14x | 1.76 | 1.55x | 2.34 | |
Hunyuan-7B-Instruct | 1.90x | 3.11 | 3.12x | 5.09 | 2.74x | 4.34 | 1.47x | 2.39 | 2.31x | 3.73 |
本项目的代码依照 License for AngelSlim 协议开源。
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={7},
url={https://github.com/Tencent/AngelSlim},
}
- AngelSlim正在快速迭代更新中,后续会推出更多的功能,有问题或建议欢迎通过GitHub Issues给我们提issue,或者加入微信技术交流群。