-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Gather qmm batched kernel and refactoring of quantized #2078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!! I think the refactor turned out really well and the gather_qmm perf looks great!
int N = out.shape(-1); | ||
|
||
int vector_limit = transpose_ ? get_qmv_batch_limit(K, N, d) : 4; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much easier to read. Nice job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Just fix the typos 🦔
For DeepSeek with its 256 experts we have a smaller speedup for small prompts and we 're back at 2x when there are enough tokens to fill the experts.
As expected I see no regression in non MoE models so I think this is good to merge. |
This is part 2 of the #2040 kind of. It implements the same kernel but for quantized mm and refactors
quantized.cpp
. It also sets upquantized.h
for a refactor of the batched kernel (mergeqmm_n
andqmm_t
and more).In the process of refactoring it fixes the following 2 bugs
gather_qmv
toqmv_quad
qvm
In the microbenchmark we see similar speedups as for #2040. A little less impressive because
qmv
is faster and we have to skip less data to move from matrix to matrix. I am adding some real world MoE prompt processing speedups below