-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Speedup int4mm_kernel with NEON #124257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup int4mm_kernel with NEON #124257
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124257
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit aa84d50 with merge base 0f6ce45 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge -f "Lint + MacOS builds are green" |
The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot. |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Hi @malfet , how can I test this PR on aarch64 linux? |
@snadampal here is the fix #124511, but we really need some sort of CI to be able to spot those earlier than nightly. Right now it's tested in M1, which is the same CPU arch, but different compiler by default, which is less stringent about type conversions |
My top priority is to get my CI PR merged ASAP. |
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro | eager (before) | eager (after) | compile(before) | compile (after) | | ---- | --- | -- | -- | | 28 | 57 | 31 | 104 | Pull Request resolved: pytorch#124257 Approved by: https://github.com/mikekgfb
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro | eager (before) | eager (after) | compile(before) | compile (after) | | ---- | --- | -- | -- | | 28 | 57 | 31 | 104 | Pull Request resolved: pytorch#124257 Approved by: https://github.com/mikekgfb
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire
n
loop actually makes it a tad slower, probably because ARM has smaller register file that x86Before/after performance running stories110M on M2Pro
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10