Skip to content

Conversation

awni
Copy link
Member

@awni awni commented Feb 13, 2025

Speeds up small batch qvm and qmv by swapping batch and block dimensions in the kernel:

Speculative generation benchmark on M2 Ultra:

mlx_lm.generate --model mlx-community/Qwen2.5-32B-Instruct-4bit --prompt "Write a quicksort algorithm" --draft-model mlx-community/Qwen2.5-0.5B-Instruct-4bit -m 1000 --temp 0               

Pre: Generation: 390 tokens, 31.786 tokens-per-sec
Post: Generation: 390 tokens, 37.843 tokens-per-sec
No draft model: Generation: 390 tokens, 31.765 tokens-per-sec

Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go! 🚀🚀🚀

@barronalex
Copy link
Contributor

Nice!! 🚀

@awni
Copy link
Member Author

awni commented Feb 13, 2025

Helps with ml-explore/mlx-examples#1281

@awni awni merged commit e425dc0 into main Feb 13, 2025
5 checks passed
@awni awni deleted the faster_small_batch_qmv branch February 13, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants