Skip to content

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Apr 17, 2024

By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire n loop actually makes it a tad slower, probably because ARM has smaller register file that x86
Before/after performance running stories110M on M2Pro

eager (before) eager (after) compile(before) compile (after)
28 57 31 104

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Copy link

pytorch-bot bot commented Apr 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124257

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aa84d50 with merge base 0f6ce45 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 17, 2024
@malfet malfet requested review from kimishpatel and mikekgfb April 17, 2024 04:12
Copy link
Contributor

@mikekgfb mikekgfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@malfet malfet added the ciflow/mps Run MPS tests (subset of trunk) label Apr 17, 2024
@malfet
Copy link
Contributor Author

malfet commented Apr 17, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor Author

malfet commented Apr 17, 2024

@pytorchbot merge -f "Lint + MacOS builds are green"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copy link
Contributor

@mikekgfb mikekgfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@snadampal
Copy link
Collaborator

snadampal commented Apr 18, 2024

Hi @malfet , how can I test this PR on aarch64 linux?
currently the build is broken on aarch64 linux with this, due to sign mismatch. I have modified it a bit, build went through fine, next I wanted to test.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Apr 19, 2024

@malfet
Copy link
Contributor Author

malfet commented Apr 19, 2024

Hi @malfet , how can I test this PR on aarch64 linux? currently the build is broken on aarch64 linux with this, due to sign mismatch. I have modified it a bit, build went through fine, next I wanted to test.

@snadampal here is the fix #124511, but we really need some sort of CI to be able to spot those earlier than nightly. Right now it's tested in M1, which is the same CPU arch, but different compiler by default, which is less stringent about type conversions

@snadampal
Copy link
Collaborator

My top priority is to get my CI PR merged ASAP.
I'm making all the tests pass in next two days. It will be great if you could review it meanwhile.

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
  By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
  Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86
  Before/after performance running stories110M on M2Pro

 | eager (before) | eager (after) | compile(before) | compile (after) |
 | ---- | --- | -- | -- |
 | 28 | 57  | 31 | 104 |

Pull Request resolved: pytorch#124257
Approved by: https://github.com/mikekgfb
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
  By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
  Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86
  Before/after performance running stories110M on M2Pro

 | eager (before) | eager (after) | compile(before) | compile (after) |
 | ---- | --- | -- | -- |
 | 28 | 57  | 31 | 104 |

Pull Request resolved: pytorch#124257
Approved by: https://github.com/mikekgfb
@github-actions github-actions bot deleted the malfet/enable-int4-neon branch June 1, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants