-
Notifications
You must be signed in to change notification settings - Fork 2.8k
apply wna16marlin kernel in moe weight only quantization #5639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: HandH1998 <1335248067@qq.com>
Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove part of vllm dependency and import error * fix: modify the include package of gptq_marlin_repack * fix: remove vllm dependency of fused_moe * fix: delay the try_get_optimal_moe_config import * fix: replace namespace of gptq_marlin_repack * fix: update the head file of gptq_marlin_repack.cu * update cmakelists * modify the repack.cu * modify namespace * update head files * update namespace marlin->marlin_moe_wna16 * update namespace * add namespace * update headfile * remove invalid parentheses * update CMakeLists.txt * update headfiles * update headsfile * remove nested define * add parentheses * update * move cuh to cu * move identifier from namespace to outside * update * add namespace scope * remove condition of compile define * add null implementation for host * remove namespace scope * remove sm75 * remove define conditions & add gptq_marlin_repack_meta impl * remove repack_meta * add register namespace * add namespace in sgl_kernel_ops * add register namespace * delay the moe_align_block_size import * modify the import of moe_align_block_size * add scalar_type.py & modify import of fused_moe.py * add compilation condition & remove VLLM_AVAILABLE * remove VLLM_AVAILABLE
update accuracy accuracy subject: abstract_algebra, #q:100, acc: 0.730 |
|
We are working to completely remove the vLLM dependency and I will inform you once completed. |
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/90 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/91 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test * chore: fix PytestWarning & update quant_utils
Motivation
apply wna16marlin kernel to speed up weight only quantization dsv3 model

we oberserve a decoding speed of over 100tps when bs=1 on 8*H20 platform
performance(based on 0.4.6)
Modifications
update route of compressed_tensors_moe and fix some bugs
Co-author: @huangtingwei9988
Note: PR553 by @yych0745 @HandH1998 @sleepcoo add wna16 marlin moe to sglang which removes the dependency on vllm
Checklist