Gather qmm batched kernel and refactoring of quantized #2078

angeloskath · 2025-04-16T07:41:09Z

This is part 2 of the #2040 kind of. It implements the same kernel but for quantized mm and refactors quantized.cpp. It also sets up quantized.h for a refactor of the batched kernel (merge qmm_n and qmm_t and more).

In the process of refactoring it fixes the following 2 bugs

routing gather_qmv to qmv_quad
assuming N >= 64 for qvm

In the microbenchmark we see similar speedups as for #2040. A little less impressive because qmv is faster and we have to skip less data to move from matrix to matrix. I am adding some real world MoE prompt processing speedups below

Model	Prompt size	Before unsorted	Before sorted	Now sorted
Mixtral 8x7B	~500	171 tps	189 tps	590 tps
Mixtral 8x7B	~6000	179 tps	196 tps	681 tps
Qwen 1.5 2.7B	~500	1071 tps	1239 tps	2213 tps
Qwen 1.5 2.7B	~6000	1333 tps	1532 tps	3360 tps
Llama 4 Scout	~500	244 tps	253 tps	429 tps
Llama 4 Scout	~6000	258 tps	264 tps	526 tps

mlx/backend/metal/kernels/quantized.h

barronalex

Awesome!! I think the refactor turned out really well and the gather_qmm perf looks great!

barronalex · 2025-04-16T17:15:34Z

mlx/backend/metal/quantized.cpp

+  int N = out.shape(-1);
+
+  int vector_limit = transpose_ ? get_qmv_batch_limit(K, N, d) : 4;
+


This is so much easier to read. Nice job!

awni

🚀

jagrit06

🚀
Just fix the typos 🦔

mlx/backend/metal/kernels/quantized.h

angeloskath · 2025-04-17T20:52:29Z

For DeepSeek with its 256 experts we have a smaller speedup for small prompts and we 're back at 2x when there are enough tokens to fill the experts.

Model	Context	Before sorted	Now sorted
Deepseek V3	~450	112 tps	154 tps
Deepseek V3	~2000	114 tps	210 tps

As expected I see no regression in non MoE models so I think this is good to merge.

angeloskath added 18 commits April 14, 2025 16:39

Start the gather qmm

5dd1667

tmp

7f96120

Start porting the matmul to the new routing

16b1fa9

Add qmv to the new routing

ce6bccc

Add all qmvs

d53f621

Add the qvm

edb7039

Add qmm

d2f74af

Merge qmvs

5f0ae20

Fix qmm_n kernel name

ac07b46

Add gather qmm variants

e37b880

Tests for the bug fixes

79659da

Fix the gather kernels

d02964b

Rename to gather_ and clean up

4cbf2e2

Add a gather_qmm benchmark

157a866

Gather qmm rhs working for aligned

6997cd3

Add the unaligned cases for gather_qmm_rhs

9caa95b

Add test and fix various bugs

12ba5a2

Add group size and bits to jit

78d3e88

angeloskath requested review from awni, barronalex and jagrit06 April 16, 2025 07:41

jagrit06 reviewed Apr 16, 2025

View reviewed changes

mlx/backend/metal/kernels/quantized.h Outdated Show resolved Hide resolved

mlx/backend/metal/kernels/quantized.h Outdated Show resolved Hide resolved

mlx/backend/metal/kernels/quantized.h Outdated Show resolved Hide resolved

mlx/backend/metal/kernels/quantized.h Show resolved Hide resolved

barronalex approved these changes Apr 16, 2025

View reviewed changes

awni approved these changes Apr 17, 2025

View reviewed changes

jagrit06 approved these changes Apr 17, 2025

View reviewed changes

Add a barrier before finalize

6da1112

angeloskath force-pushed the gather-qmm branch from 076cec4 to 6da1112 Compare April 17, 2025 20:15

angeloskath merged commit 5de6d94 into main Apr 17, 2025
4 checks passed

angeloskath deleted the gather-qmm branch April 17, 2025 20:53

angeloskath mentioned this pull request Apr 17, 2025

Inform the gather mm regarding the sorted indices ml-explore/mlx-lm#100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gather qmm batched kernel and refactoring of quantized #2078

Gather qmm batched kernel and refactoring of quantized #2078

Uh oh!

angeloskath commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barronalex left a comment

Uh oh!

barronalex Apr 16, 2025

Uh oh!

awni left a comment

Uh oh!

jagrit06 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angeloskath commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

		int N = out.shape(-1);

		int vector_limit = transpose_ ? get_qmv_batch_limit(K, N, d) : 4;

Gather qmm batched kernel and refactoring of quantized #2078

Gather qmm batched kernel and refactoring of quantized #2078

Uh oh!

Conversation

angeloskath commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barronalex left a comment

Choose a reason for hiding this comment

Uh oh!

barronalex Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

jagrit06 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angeloskath commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!