apply wna16marlin kernel in moe weight only quantization #5639

AniZpZ · 2025-04-22T14:00:09Z

Motivation

apply wna16marlin kernel to speed up weight only quantization dsv3 model
we oberserve a decoding speed of over 100tps when bs=1 on 8*H20 platform

performance(based on 0.4.6)

Concurrency	DS R1 FP8 Throughput	DS R1 FP8 TTFT	DS R1 FP8 ITL	DS R1 FP8 E2E	This PR Throughput	This PR TTFT	This PR ITL	This PR E2E	Lift
1	66.36	151.63	14.94	15430.27	108.09	160.30	9.11	9472.19	62.88%
2	112.02	239.62	17.64	18279.83	150.22	238.39	13.10	13631.87	34.10%
4	196.29	186.09	20.22	20862.50	249.53	98.81	15.95	16409.83	27.12%
8	329.92	340.68	23.94	24819.04	386.28	112.75	20.62	21192.90	17.08%
16	507.90	431.63	30.13	31238.33	554.05	130.42	27.93	28691.76	9.08%
32	742.78	566.84	39.50	40956.04	762.38	153.72	39.22	40260.41	2.64%

Modifications

update route of compressed_tensors_moe and fix some bugs
Co-author: @huangtingwei9988

Note: PR553 by @yych0745 @HandH1998 @sleepcoo add wna16 marlin moe to sglang which removes the dependency on vllm

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: HandH1998 <1335248067@qq.com>

Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove part of vllm dependency and import error * fix: modify the include package of gptq_marlin_repack * fix: remove vllm dependency of fused_moe * fix: delay the try_get_optimal_moe_config import * fix: replace namespace of gptq_marlin_repack * fix: update the head file of gptq_marlin_repack.cu * update cmakelists * modify the repack.cu * modify namespace * update head files * update namespace marlin->marlin_moe_wna16 * update namespace * add namespace * update headfile * remove invalid parentheses * update CMakeLists.txt * update headfiles * update headsfile * remove nested define * add parentheses * update * move cuh to cu * move identifier from namespace to outside * update * add namespace scope * remove condition of compile define * add null implementation for host * remove namespace scope * remove sm75 * remove define conditions & add gptq_marlin_repack_meta impl * remove repack_meta * add register namespace * add namespace in sgl_kernel_ops * add register namespace * delay the moe_align_block_size import * modify the import of moe_align_block_size * add scalar_type.py & modify import of fused_moe.py * add compilation condition & remove VLLM_AVAILABLE * remove VLLM_AVAILABLE

AniZpZ · 2025-06-18T13:32:29Z

update accuracy

accuracy
mmlu

subject: abstract_algebra, #q:100, acc: 0.730
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.947
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.620
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.843
subject: computer_security, #q:100, acc: 0.860
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.737
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.939
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.879
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.770
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.841
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.941
subject: human_aging, #q:223, acc: 0.843
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.942
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.926
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.955
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.761
subject: nutrition, #q:306, acc: 0.905
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.941
subject: professional_accounting, #q:282, acc: 0.872
subject: professional_law, #q:1534, acc: 0.699
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.898
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 177.803
Average accuracy: 0.870

zhyncs · 2025-06-18T22:08:03Z

@AniZpZ

please fix the conflicts
may we submit the sgl-kernel related pr first

AniZpZ · 2025-06-19T02:42:14Z

@AniZpZ

please fix the conflicts

may we submit the sgl-kernel related pr first

We are working to completely remove the vLLM dependency and I will inform you once completed.

Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/90 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test

Merge branch 'wyt/fix_vllm_dependency of git@code.alipay.com:Theta/SGLang.git into w4a16_marlin https://code.alipay.com/Theta/SGLang/pull_requests/91 Reviewed-by: 庄森 <zhuangsen.zp@antgroup.com> * fix: remove vllm dependency in compressed_tensors_moe & add gptq_marlin_moe_repack in gptq * update gptq_marlin_repack in gptq * add condition of fp8_config in radixattention * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove vllm dependency & add kernel * chore: remove gptq_marlin_gemm kernel * add unit test * add copyright & add unit test * chore: fix PytestWarning & update quant_utils

AniZpZ · 2025-07-01T12:37:18Z

@AniZpZ

please fix the conflicts

may we submit the sgl-kernel related pr first

@zhyncs
i submit a new pr only related with sgl-kernel #7683

huangtingwei9988 and others added 15 commits April 15, 2025 19:30

fix compressed_tensors bug

7c26ed5

revert code

cd9a28d

Use moe_wna16 kernel for compressed tensors wna16 moe models

027d5fe

fix

2e5beaf

fix

ea61d97

fix

a1202f7

merge main

2e215d8

merge main

62f831a

fix

9010bc6

fix --quantization choices

741d7f2

update compressed tensor route

7b2bab7

fix

fb2ec86

fix

e252ad5

Merge remote-tracking branch 'origin/main' into w4a16_marlin

f4c6514

upd

532cb9b

AniZpZ requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu and HaiShaw as code owners April 22, 2025 14:00

upd

40b29b2

zhyncs assigned yizhang2077 and HandH1998 Apr 22, 2025

AniZpZ and others added 5 commits April 25, 2025 12:23

Merge branch 'main' into w4a16_marlin

c40eb5e

moe w4a16 marlin kernel cmake build

80592a9

Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: HandH1998 <1335248067@qq.com>

optimize kernel

b96a2a2

add moe w4a16 kernel

143fddc

wna16 marlin

1345f95

AniZpZ requested review from BBuf, yizhang2077 and FlamingoPg as code owners April 25, 2025 09:28

format

7a0affa

AniZpZ changed the title ~~[WIP] apply wna16marlin kernel in moe weight only quantization~~ apply wna16marlin kernel in moe weight only quantization Apr 25, 2025

merrymercy and others added 3 commits April 26, 2025 18:42

Merge branch 'main' into w4a16_marlin

fd19900

Merge branch 'main' into w4a16_marlin

9bfa2a5

Merge branch 'main' into w4a16_marlin

f8e5bc0

AniZpZ requested a review from ch-wan as a code owner May 6, 2025 07:44

AniZpZ and others added 3 commits May 30, 2025 17:26

Merge remote-tracking branch 'origin/main' into w4a16_marlin

0c23241

Merge remote-tracking branch 'origin/main' into w4a16_marlin

706013a

AniZpZ requested a review from zhaochenyang20 as a code owner June 17, 2025 07:33

AniZpZ and others added 5 commits June 17, 2025 15:37

upd

9224a04

remove vllm in moe_wna16

21bcc25

Merge remote-tracking branch 'origin/main' into w4a16_marlin

7b48903

Merge branch 'main' into w4a16_marlin

4e9d26d

format

af7e564

AniZpZ mentioned this pull request Jun 23, 2025

[Bug] Cannot inference Deepseek R1 with AWQ quantization #6312

Closed

5 tasks

弋云 and others added 3 commits June 24, 2025 11:42

Merge remote-tracking branch 'origin/main' into w4a16_marlin

b970f21

AniZpZ force-pushed the w4a16_marlin branch from 7deb74a to b970f21 Compare June 25, 2025 07:17

format

aba3d6a

AniZpZ mentioned this pull request Jul 1, 2025

[1/n] apply wna16marlin kernel in moe weight only quantization #7683

Merged

6 tasks

AniZpZ closed this Jul 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apply wna16marlin kernel in moe weight only quantization #5639

apply wna16marlin kernel in moe weight only quantization #5639

Uh oh!

AniZpZ commented Apr 22, 2025 •

edited

Loading

Uh oh!

AniZpZ commented Jun 18, 2025

Uh oh!

zhyncs commented Jun 18, 2025

Uh oh!

AniZpZ commented Jun 19, 2025

Uh oh!

AniZpZ commented Jul 1, 2025

Uh oh!

Uh oh!

apply wna16marlin kernel in moe weight only quantization #5639

apply wna16marlin kernel in moe weight only quantization #5639

Uh oh!

Conversation

AniZpZ commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

AniZpZ commented Jun 18, 2025

Uh oh!

zhyncs commented Jun 18, 2025

Uh oh!

AniZpZ commented Jun 19, 2025

Uh oh!

AniZpZ commented Jul 1, 2025

Uh oh!

Uh oh!

AniZpZ commented Apr 22, 2025 •

edited

Loading