Speedup int4mm_kernel with NEON #124257

malfet · 2024-04-17T04:10:47Z

By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire n loop actually makes it a tad slower, probably because ARM has smaller register file that x86
Before/after performance running stories110M on M2Pro

eager (before)	eager (after)	compile(before)	compile (after)
28	57	31	104

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-04-17T04:10:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124257

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aa84d50 with merge base 0f6ce45 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikekgfb

Thank you!

malfet · 2024-04-17T14:35:49Z

@pytorchbot merge

pytorchmergebot · 2024-04-17T14:38:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-04-17T16:01:36Z

@pytorchbot merge -f "Lint + MacOS builds are green"

pytorchmergebot · 2024-04-17T16:01:54Z

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

pytorchmergebot · 2024-04-17T16:04:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

mikekgfb

Thank you!

snadampal · 2024-04-18T16:04:21Z

Hi @malfet , how can I test this PR on aarch64 linux?
currently the build is broken on aarch64 linux with this, due to sign mismatch. I have modified it a bit, build went through fine, next I wanted to test.

nWEIdia · 2024-04-19T00:08:27Z

Nightly build failure: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50 and e.g. https://github.com/pytorch/pytorch/actions/runs/8734110986/job/23964095976

malfet · 2024-04-19T19:00:48Z

Hi @malfet , how can I test this PR on aarch64 linux? currently the build is broken on aarch64 linux with this, due to sign mismatch. I have modified it a bit, build went through fine, next I wanted to test.

@snadampal here is the fix #124511, but we really need some sort of CI to be able to spot those earlier than nightly. Right now it's tested in M1, which is the same CPU arch, but different compiler by default, which is less stringent about type conversions

snadampal · 2024-04-20T00:04:36Z

My top priority is to get my CI PR merged ASAP.
I'm making all the tests pass in next two days. It will be great if you could review it meanwhile.

By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro | eager (before) | eager (after) | compile(before) | compile (after) | | ---- | --- | -- | -- | | 28 | 57 | 31 | 104 | Pull Request resolved: pytorch#124257 Approved by: https://github.com/mikekgfb

Speedup int4mm_kernel with NEON

aa84d50

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 17, 2024

malfet requested review from kimishpatel and mikekgfb April 17, 2024 04:12

malfet added release notes: performance_as_product topic: improvements topic category labels Apr 17, 2024

mikekgfb approved these changes Apr 17, 2024

View reviewed changes

malfet added the ciflow/mps Run MPS tests (subset of trunk) label Apr 17, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2024

pytorchmergebot added the merging label Apr 17, 2024

pytorchmergebot added the Merged label Apr 17, 2024

pytorchmergebot closed this in 46324fe Apr 17, 2024

pytorchmergebot removed the merging label Apr 17, 2024

mikekgfb reviewed Apr 18, 2024

View reviewed changes

snadampal mentioned this pull request Apr 18, 2024

aarch64: cd: test openmp switch from libomp to libgomp #124353

Closed

Rohanjames1997 mentioned this pull request Apr 19, 2024

[NEON] Remove implicit type conversions in tinygemm_kernel #124508

Closed

github-actions bot deleted the malfet/enable-int4-neon branch June 1, 2024 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup int4mm_kernel with NEON #124257

Speedup int4mm_kernel with NEON #124257

Uh oh!

malfet commented Apr 17, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Apr 17, 2024 •

edited

Loading

Uh oh!

mikekgfb left a comment

Uh oh!

malfet commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Uh oh!

malfet commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Uh oh!

mikekgfb left a comment

Uh oh!

snadampal commented Apr 18, 2024 •

edited

Loading

Uh oh!

nWEIdia commented Apr 19, 2024

Uh oh!

malfet commented Apr 19, 2024

Uh oh!

snadampal commented Apr 20, 2024

Uh oh!

Uh oh!

Speedup int4mm_kernel with NEON #124257

Speedup int4mm_kernel with NEON #124257

Uh oh!

Conversation

malfet commented Apr 17, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124257

✅ No Failures

Uh oh!

mikekgfb left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Merge started

Uh oh!

malfet commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Uh oh!

pytorchmergebot commented Apr 17, 2024

Merge started

Uh oh!

mikekgfb left a comment

Choose a reason for hiding this comment

Uh oh!

snadampal commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Apr 19, 2024

Uh oh!

malfet commented Apr 19, 2024

Uh oh!

snadampal commented Apr 20, 2024

Uh oh!

Uh oh!

malfet commented Apr 17, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 17, 2024 •

edited

Loading

snadampal commented Apr 18, 2024 •

edited

Loading