Skip to content

Conversation

awni
Copy link
Member

@awni awni commented Feb 5, 2025

Speed up Metal sort, particularly for when sorting over a small dimension.

  • Add a couple more specializations for the block size
  • Use work_per_thread = 4 instead of 8 which seems to work better in benchmarks on M2
Shape Pre Post
(160,) 0.687 (ms) 0.467 (ms)
(1024, 4) 1.308 (ms) 0.457 (ms)
(16384,) 1.819 (ms) 1.589 (ms)
(1024, 1024) 1.740 (ms) 1.510 (ms)
(4096, 1024) 5.277 (ms) 4.540 (ms)
(4096, 256) 4.085 (ms) 1.231 (ms)

Also speeds up DSR1 generation time by ~0.8 tok/s. Time for 4-bit on 3 M2 Ultras generating 128 tokens:
Pre: 17.783 tokens-per-sec
Post: 18.556 tokens-per-sec

Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty neat speed up!!!

@awni awni merged commit fe5987b into main Feb 5, 2025
5 checks passed
@awni awni deleted the faster_sort branch February 5, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants