Skip to content

Tencent/AngelSlim

Repository files navigation

简体中文 | English

AngelSlim

致力于打造更易用、更全面和更高效的大模型压缩工具包

📖 Documentation   |   🤗 Hugging Face   |   🤖 ModelScope   |   💬 WeChat (微信) |   🫨 Discord

目录

📣最新进展

  • [25/09/01] 我们支持了Hunyuan-MT-7B翻译开源模型的FP8量化;支持了Eagle3的Torch推理及Benchmark评测流程;支持了FLUX的量化、Cache;支持了Seed-OSS模型量化压缩。
  • [25/08/06] 我们支持了Hunyuan 0.5B/1.8B/4B/7BQwen2.5VL 3B/7B/32B/72B的FP8、INT4量化,支持了DeepSeek-R1/V3Kimi-K2模型的FP8-StaticW4A8-FP8量化。我们还开源了Hunyuan 1.8B/4B/7B系列模型的Eagle3权重。
  • [25/07/04] 我们支持了Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen等模型的量化,包含INT8、FP8、INT4等算法。 我们还开源了Qwen3系列模型的Eagle3权重。

Coming soon:

  • Diffusion模型压缩支持
  • 投机采样新算法发布

🌟主要特性

  • 高度集成化:本工具将主流的压缩算法集成到工具,开发者可一键式调用,具有很好的易用性。
  • 持续算法创新:本工具除了集成工业界使用最广的算法,还持续自研更好的压缩算法,并且会陆续开源。
  • 追求极致性能:在模型压缩流程、压缩算法部署方面,本工具持续端到端优化,例如单卡GPU可量化Qwen3-235B和Deepseek-R1。

💼支持模型

量化

目前已支持文生文任务Hunyuan-Dense、Hunyuan-MoE、Qwen3-Dense、Qwen3-MoE、Qwen2.5、DeepSeek-R1蒸馏Qwen模型、QwQ等系列的主要模型:

模型名 FP8-Dynamic FP8-Static INT8-Dynamic INT4-GPTQ INT4-AWQ
Hunyuan-Dense
Hunyuan-MoE
Qwen3-Dense
Qwen3-MoE
Qwen2.5
DeepSeek-R1-Distill-Qwen
QwQ

投机采样

Eagle3

目前已开源Qwen3和Hunyuan系列模型的Eagle3权重。

Qwen3 Models Hunyuan Models
Qwen3-1.7B Hunyuan-1.8B-Instruct
Qwen3-4B Hunyuan-4B-Instruct
Qwen3-8B Hunyuan-7B-Instruct
Qwen3-14B
Qwen3-32B
Qwen3-30B-A3B

🛎️如何使用

安装 AngelSlim

推荐使用pip直接安装最新稳定版AngelSlim

pip install angelslim

也可以选择克隆代码仓库后,以可编辑的方式从源代码安装:

cd AngelSlim && python setup.py install

更详细的安装说明可参考安装文档

快速开始

量化

完成安装AngelSlim后,您可以通过以下脚本快速开始,完成Qwen3-1.7B模型的静态FP8量化:

  • 一键式启动

    python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml

    该示例将会加载HugggingFace模型, 使用config配置的dataset数据进行激活值校准,量化产出模型权重.

  • 源码启动

    Qwen3-1.7B完成动态FP8量化:

    from angelslim.engine import Engine
    
    slim_engine = Engine()
    # Prepare model
    slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
    # Initialize compressor
    slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
    # Compress model
    slim_engine.run()
    # Save compressed model
    slim_engine.save("./output")

详情请参考快速开始文档

投机采样

完成安装AngelSlim后,您可以通过以下脚本快速开始,完成Eagle3的Pytorch性能测试:

python3 tools/spec_benchmark.py \
    --base-model-path /path/to/base/model \
    --eagle-model-path /path/to/eagle/model \
    --model-id your_model_id \
    --mode both

详情请参考快速开始文档

部署与测试

1. 离线推理

如果需要通过transformers加载量化模型,请在量化模型配置的global中设置deploy_backend: huggingface,或者直接手动将量化产出模型路径下config.json配置中的key ignored_layers改为ignore

测试transformers加载量化模型离线推理:

python deploy/offline.py $MODEL_PATH

其中 MODEL_PATH 为量化产出模型路径。

2. 服务部署

支持通过以下推理框架部署 OpenAI 兼容的 API 服务:

vLLM

vLLM 服务启动脚本,建议版本vllm>=0.8.5.post1,部署MOE INT8量化模型需要vllm>=0.9.2

bash deploy/run_vllm.sh $MODEL_PATH

SGLang

SGLang 服务启动脚本,建议版本 sglang>=0.4.6.post1

bash deploy/run_sglang.sh $MODEL_PATH

3. 服务调用

通过 OpenAI 格式 接口发起请求:

bash deploy/openai.sh $MODEL_PATH

4. 效果验证

使用 lm-evaluation-harness 评估量化模型精度,建议版本lm-eval>=0.4.8

bash deploy/lm_eval.sh $MODEL_PATH

详细操作指南请参阅部署文档

📈Benchmark

(1)量化

下面只展示了部分模型的效果测试情况,完整Benchmark可以参考Benchmark文档

Hunyuan系列模型

Hunyuan-Instruct的BF16FP8INT4-GPTQINT4-AWQOlympiadBenchAIME 2024DROPGPQA-Diamond上的评测结果如下:

ModelQuantizationOlympiadBenchAIME 2024DROPGPQA-Diamond
Hunyuan-A13B-Instruct BF1682.787.3091.171.2
FP8-Static83.086.791.1-
Int4-GPTQ82.786.791.1-
Int4-AWQ82.685.691.0-
Hunyuan-7B-Instruct BF16 76.581.185.960.1
FP8-Static76.680.986.060.1
Int4-GPTQ76.281.085.760.0
Int4-AWQ76.480.985.960.1
Hunyuan-4B-Instruct BF16 73.178.378.261.1
FP8-Static73.176.678.360.2
Int4-GPTQ72.9-78.158.1
Int4-AWQ72.8-78.2-
Hunyuan-1.8B-Instruct BF16 63.456.776.747.2
FP8-Static62.555.275.147.7
Int4-GPTQ60.9-73.044.4
Int4-AWQ61.7-71.743.6
Hunyuan-0.5B-Instruct BF16 29.617.252.823.3
FP8-Static29.617.251.622.5
Int4-GPTQ26.8-50.923.3
Int4-AWQ26.3-48.923.3

Qwen3系列模型

Qwen3系列模型的BF16FP8-StaticFP8-DynamicINT8-DynamicINT4-GPTQINT4-AWQCEVALMMLUGSM8KHUMANEVAL上的评测结果如下:

ModelQuantizationCEVALMMLUGSM8KHUMANEVAL
Qwen3-0.6BBF1645.8447.2142.9919.51
FP8-Static45.9946.8738.0618.90
FP8-Dynamic45.9946.9338.2920.73
INT8-Dynamic45.1746.9541.1721.34
Qwen3-8BBF1679.2774.7887.7963.41
FP8-Static78.2374.7986.9662.20
FP8-Dynamic78.4574.7587.6462.80
INT8-Dynamic78.0174.8486.9667.07
INT4-GPTQ77.1973.2686.4362.20
INT4-AWQ76.1573.5986.9663.41
Qwen3-14BBF1683.0678.9088.4055.49
FP8-Static82.6278.5789.4657.32
FP8-Dynamic82.2478.9288.3252.44
INT8-Dynamic81.8778.1386.2856.10
INT4-GPTQ81.0578.0287.3457.93
INT4-AWQ82.0277.6884.2361.59
Qwen3-32BBF1686.5582.0074.5337.80
FP8-Static86.9281.7870.2039.63
FP8-Dynamic86.5581.8970.4338.41
INT4-GPTQ86.1881.01-43.29
INT4-AWQ86.1881.54-36.59
Qwen3-30B-A3BBF1683.6679.3689.9931.71
FP8-Static83.9579.4789.0131.10
FP8-Dynamic84.1079.4089.1632.93
INT8-Dynamic83.3679.4889.1634.15
Qwen3-235B-A22BBF1689.6086.2885.2927.44
FP8-Static89.6786.1986.9627.44
FP8-Dynamic89.6786.1885.2228.05
INT8-Dynamic88.9386.2086.2023.78
QwQ-32BBF1685.7482.0373.3142.68
FP8-Static85.4481.9175.3642.68
FP8-Dynamic85.0781.9375.6642.07
INT4-GPTQ84.0381.2668.2345.73
INT4-AWQ83.5881.0168.6943.29

Qwen2.5VL系列模型

Qwen2.5VL系列模型的BF16FP8-StaticFP8-DynamicINT4-GPTQINT4-AWQMMMU_VALDocVQA_VALChartQA_TEST上的评测结果如下:

ModelQuantizationMMMU_VALMMLDocVQA_VALUChartQA_TEST
Qwen2.5VL-3BBF1647.1178.5780.32
FP8-Static47.3379.3479.68
FP8-Dynamic45.9946.9338.29
INT4-GPTQ46.5677.2078.96
INT4-AWQ45.78-79.60
Qwen2.5VL-7BBF1645.4489.7184.64
FP8-Static47.0089.8385.92
FP8-Dynamic47.2289.8088.64
INT4-GPTQ46.6790.45-
INT4-AWQ45.6789.28-
Qwen2.5VL-32BBF1657.0090.03-
FP8-Static57.0089.88-
FP8-Dynamic56.4489.88-
INT4-GPTQ55.2289.80 -
INT4-AWQ55.2290.30-
Qwen2.5VL-72BBF1658.7894.3985.60
FP8-Static57.8994.4185.84
FP8-Dynamic58.6794.3885.60
INT4-GPTQ57.5694.4686.48
INT4-AWQ58.7894.1987.28

DeepSeek系列模型

DeepSeek-R1-0528模型的FP8-Block-WiseW4A8-FP8GPQA DiamondAIME 2024SimpleQALiveCodeBench上的评测结果如下:

ModelQuantizationGPQA DiamondAIME 2024SimpleQALiveCodeBench
DeepSeek-R1-0528FP8-Block-Wise78.2888.6727.877.1
W4A8-FP877.3788.6726.8378.86

备注

  • 以上评测结果使用TRT-LLM框架部署测试5次求平均
  • 评测时使用的超参如下:
{
 "top_k": 20,
 "top_p": 0.6,
 "temperature": 0.7,
 "output_seq_len": 32768,
 "max_input_seq_len": 16384
}

其他模型

其他模型的BF16FP8-StaticFP8-DynamicINT4-GPTQINT4-AWQCEVALMMLUGSM8K上的评测结果如下:

ModelQuantizationCEVALMMLUGSM8K
Qwen2.5-1.5B-InstructBF1667.0160.0554.28
FP8-Static66.2760.23-
FP8-Dynamic66.7960.0851.71
Qwen2.5-7B-InstructBF1681.2074.5579.98
FP8-Static81.1374.0379.30
FP8-Dynamic80.3174.0779.00
INT4-GPTQ79.0573.0574.75
INT4-AWQ79.3573.2279.38
Qwen2.5-32B-InstructBF1687.3083.2181.73
FP8-Static87.5983.0881.58
FP8-Dynamic87.3083.0481.58
INT4-GPTQ86.7082.4582.03
INT4-AWQ87.0082.64-
DeepSeek-R1-Distill-Qwen-7BBF1653.4953.8075.74
FP8-Static53.5754.1776.19
FP8-Dynamic52.9754.1374.15
INT4-GPTQ51.8652.4475.89
INT4-AWQ53.4953.70-
DeepSeek-R1-Distill-Qwen-14BBF1677.7174.2885.67
FP8-Static77.5674.6686.73
FP8-Dynamic76.8274.6387.11
INT4-GPTQ74.2972.3784.61
INT4-AWQ74.8173.0086.05
DeepSeek-R1-Distill-Qwen-32BBF1684.1880.8987.41
FP8-Static83.4380.9087.57
FP8-Dynamic83.7381.1086.43
INT4-GPTQ84.1079.8086.73
INT4-AWQ82.8480.1587.19

(2)投机采样

Qwen3 Series Models

Qwen3系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:

   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Qwen3-1.7B2.05x2.812.07x2.932.11x2.981.93x2.692.04x2.85
Qwen3-4B2.21x3.012.36x3.242.42x3.132.32x2.752.33x3.03
Qwen3-8B2.63x3.652.76x3.852.82x3.902.62x3.482.70x3.72
Qwen3-14B2.23x3.302.53x3.742.56x3.792.16x3.132.37x3.49
Qwen3-32B2.39x2.782.37x2.812.47x2.922.42x2.532.41x2.76
Qwen3-30B-A3B2.84x3.632.27x3.092.64x3.422.83x3.562.64x3.42
T=1 Qwen3-1.7B1.74x2.531.86x2.701.82x2.691.72x2.461.93x2.60
Qwen3-4B1.93x2.602.00x2.842.11x2.822.34x2.501.75x2.69
Qwen3-8B1.98x2.752.25x3.112.31x3.152.10x2.762.90x2.94
Qwen3-14B1.71x2.611.95x2.872.04x3.081.68x2.552.90x2.78
Qwen3-32B1.62x1.911.71x2.051.78x2.101.80x1.951.62x2.00
Qwen3-30B-A3B1.91x2.462.00x2.641.90x2.531.80x2.321.90x2.48

Hunyuan系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:

   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Hunyuan-1.8B-Instruct1.97x2.902.58x3.732.61x3.711.71x2.432.22x3.19
Hunyuan-4B-Instruct1.77x2.602.64x3.352.14x3.171.72x2.572.07x2.92
Hunyuan-7B-Instruct2.22x3.583.59x5.472.96x4.681.64x2.562.60x4.07
T=1 Hunyuan-1.8B-Instruct1.58x2.362.35x3.562.23x3.381.26x1.871.86x2.79
Hunyuan-4B-Instruct1.36x2.051.97x2.861.72x2.681.14x1.761.55x2.34
Hunyuan-7B-Instruct1.90x3.113.12x5.092.74x4.341.47x2.392.31x3.73

📝许可协议

本项目的代码依照 License for AngelSlim 协议开源。

🔗引用

@software{AngelSlim2025,
    title={{AngelSlim}},
    author={Tencent AngelSlim Project Contributors},
    year={2025},
    month={7},
    url={https://github.com/Tencent/AngelSlim},
}

💬技术交流

  • AngelSlim正在快速迭代更新中,后续会推出更多的功能,有问题或建议欢迎通过GitHub Issues给我们提issue,或者加入微信技术交流群

About

Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 9