faster sort #1831

awni · 2025-02-05T04:19:35Z

Speed up Metal sort, particularly for when sorting over a small dimension.

Add a couple more specializations for the block size
Use work_per_thread = 4 instead of 8 which seems to work better in benchmarks on M2

Shape	Pre	Post
(160,)	0.687 (ms)	0.467 (ms)
(1024, 4)	1.308 (ms)	0.457 (ms)
(16384,)	1.819 (ms)	1.589 (ms)
(1024, 1024)	1.740 (ms)	1.510 (ms)
(4096, 1024)	5.277 (ms)	4.540 (ms)
(4096, 256)	4.085 (ms)	1.231 (ms)

Also speeds up DSR1 generation time by ~0.8 tok/s. Time for 4-bit on 3 M2 Ultras generating 128 tokens:
Pre: 17.783 tokens-per-sec
Post: 18.556 tokens-per-sec

angeloskath

Pretty neat speed up!!!

faster sort

58ff7ab

awni requested review from angeloskath and barronalex February 5, 2025 04:23

angeloskath approved these changes Feb 5, 2025

View reviewed changes

awni merged commit fe5987b into main Feb 5, 2025
5 checks passed

awni deleted the faster_sort branch February 5, 2025 14:10

BrewTestBot mentioned this pull request Feb 14, 2025

mlx 0.23.0 Homebrew/homebrew-core#207747

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

faster sort #1831

faster sort #1831

Uh oh!

awni commented Feb 5, 2025

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Uh oh!

faster sort #1831

faster sort #1831

Uh oh!

Conversation

awni commented Feb 5, 2025

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!