Skip to content

Conversation

angeloskath
Copy link
Member

This is part 2 of the #2040 kind of. It implements the same kernel but for quantized mm and refactors quantized.cpp. It also sets up quantized.h for a refactor of the batched kernel (merge qmm_n and qmm_t and more).

In the process of refactoring it fixes the following 2 bugs

  • routing gather_qmv to qmv_quad
  • assuming N >= 64 for qvm

In the microbenchmark we see similar speedups as for #2040. A little less impressive because qmv is faster and we have to skip less data to move from matrix to matrix. I am adding some real world MoE prompt processing speedups below

Model Prompt size Before unsorted Before sorted Now sorted
Mixtral 8x7B ~500 171 tps 189 tps 590 tps
Mixtral 8x7B ~6000 179 tps 196 tps 681 tps
Qwen 1.5 2.7B ~500 1071 tps 1239 tps 2213 tps
Qwen 1.5 2.7B ~6000 1333 tps 1532 tps 3360 tps
Llama 4 Scout ~500 244 tps 253 tps 429 tps
Llama 4 Scout ~6000 258 tps 264 tps 526 tps

Copy link
Contributor

@barronalex barronalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! I think the refactor turned out really well and the gather_qmm perf looks great!

int N = out.shape(-1);

int vector_limit = transpose_ ? get_qmv_batch_limit(K, N, d) : 4;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much easier to read. Nice job!

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@jagrit06 jagrit06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀
Just fix the typos 🦔

@angeloskath
Copy link
Member Author

For DeepSeek with its 256 experts we have a smaller speedup for small prompts and we 're back at 2x when there are enough tokens to fill the experts.

Model Context Before sorted Now sorted
Deepseek V3 ~450 112 tps 154 tps
Deepseek V3 ~2000 114 tps 210 tps

As expected I see no regression in non MoE models so I think this is good to merge.

@angeloskath angeloskath merged commit 5de6d94 into main Apr 17, 2025
4 checks passed
@angeloskath angeloskath deleted the gather-qmm branch April 17, 2025 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants