Skip to content

Conversation

AniZpZ
Copy link
Collaborator

@AniZpZ AniZpZ commented Apr 22, 2025

Motivation

apply wna16marlin kernel to speed up weight only quantization dsv3 model
we oberserve a decoding speed of over 100tps when bs=1 on 8*H20 platform
image

performance(based on 0.4.6)

Concurrency DS R1 FP8 Throughput DS R1 FP8 TTFT DS R1 FP8 ITL DS R1 FP8 E2E This PR Throughput This PR TTFT This PR ITL This PR E2E Lift
1 66.36 151.63 14.94 15430.27 108.09 160.30 9.11 9472.19 62.88%
2 112.02 239.62 17.64 18279.83 150.22 238.39 13.10 13631.87 34.10%
4 196.29 186.09 20.22 20862.50 249.53 98.81 15.95 16409.83 27.12%
8 329.92 340.68 23.94 24819.04 386.28 112.75 20.62 21192.90 17.08%
16 507.90 431.63 30.13 31238.33 554.05 130.42 27.93 28691.76 9.08%
32 742.78 566.84 39.50 40956.04 762.38 153.72 39.22 40260.41 2.64%

Modifications

update route of compressed_tensors_moe and fix some bugs
Co-author: @huangtingwei9988

Note: PR553 by @yych0745 @HandH1998 @sleepcoo add wna16 marlin moe to sglang which removes the dependency on vllm

Checklist

@AniZpZ AniZpZ changed the title [WIP] apply wna16marlin kernel in moe weight only quantization apply wna16marlin kernel in moe weight only quantization Apr 25, 2025
@AniZpZ AniZpZ requested a review from ch-wan as a code owner May 6, 2025 07:44
AniZpZ and others added 3 commits May 30, 2025 17:26
Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com>


* fix: remove part of vllm dependency and import error
* fix: modify the include package of gptq_marlin_repack
* fix: remove vllm dependency of fused_moe
* fix: delay the try_get_optimal_moe_config import
* fix: replace namespace of gptq_marlin_repack
* fix: update the head file of gptq_marlin_repack.cu
* update cmakelists
* modify the repack.cu
* modify namespace
* update head files
* update namespace marlin->marlin_moe_wna16
* update namespace
* add namespace
* update headfile
* remove invalid parentheses
* update CMakeLists.txt
* update headfiles
* update headsfile
* remove nested define
* add parentheses
* update
* move cuh to cu
* move identifier from namespace to outside
* update
* add namespace scope
* remove condition of compile define
* add null implementation for host
* remove namespace scope
* remove sm75
* remove define conditions & add gptq_marlin_repack_meta impl
* remove repack_meta
* add register namespace
* add namespace in sgl_kernel_ops
* add register namespace
* delay the moe_align_block_size import
* modify the import of moe_align_block_size
* add scalar_type.py & modify import of fused_moe.py
* add compilation condition & remove VLLM_AVAILABLE
* remove VLLM_AVAILABLE
@AniZpZ AniZpZ requested a review from zhaochenyang20 as a code owner June 17, 2025 07:33
@AniZpZ
Copy link
Collaborator Author

AniZpZ commented Jun 18, 2025

update accuracy

accuracy
mmlu

subject: abstract_algebra, #q:100, acc: 0.730
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.947
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.620
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.843
subject: computer_security, #q:100, acc: 0.860
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.737
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.939
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.879
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.770
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.841
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.941
subject: human_aging, #q:223, acc: 0.843
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.942
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.926
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.955
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.761
subject: nutrition, #q:306, acc: 0.905
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.941
subject: professional_accounting, #q:282, acc: 0.872
subject: professional_law, #q:1534, acc: 0.699
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.898
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 177.803
Average accuracy: 0.870

@zhyncs
Copy link
Member

zhyncs commented Jun 18, 2025

@AniZpZ

  • please fix the conflicts
  • may we submit the sgl-kernel related pr first

@AniZpZ
Copy link
Collaborator Author

AniZpZ commented Jun 19, 2025

@AniZpZ

  • please fix the conflicts
  • may we submit the sgl-kernel related pr first

We are working to completely remove the vLLM dependency and I will inform you once completed.

弋云 and others added 3 commits June 24, 2025 11:42
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin

https://code.alipay.com/Theta/SGLang/pull_requests/90


Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com>


* fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq
* update gptq_marlin_repack in gptq
* add condition of fp8_config in radixattention
* chore: remove vllm dependency & add kernel
* chore: remove vllm dependency & add kernel
* chore: remove vllm dependency & add kernel
* chore: remove gptq_marlin_gemm kernel
* add unit test
* add copyright & add unit test
Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin

https://code.alipay.com/Theta/SGLang/pull_requests/91

Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com>

* fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq
* update gptq_marlin_repack in gptq
* add condition of fp8_config in radixattention
* chore: remove vllm dependency & add kernel
* chore: remove vllm dependency & add kernel
* chore: remove vllm dependency & add kernel
* chore: remove gptq_marlin_gemm kernel
* add unit test
* add copyright & add unit test
* chore: fix PytestWarning & update quant_utils
@AniZpZ
Copy link
Collaborator Author

AniZpZ commented Jul 1, 2025

@AniZpZ

  • please fix the conflicts
  • may we submit the sgl-kernel related pr first

@zhyncs
i submit a new pr only related with sgl-kernel #7683

@AniZpZ AniZpZ closed this Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants