-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ROCm] performance optimization for index select #131713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131713
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 348162f with merge base d355678 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit change and one necessary change requested, but otherwise LGTM if CI is green.
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
@eqy Please let me know whether you are comfortable for me to change the default to 256 for Nvidia case. We can change all the places of 128 to 256 as a follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested a small change. Otherwise, LGTM.
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
@malfet Can you please review and approve this PR? |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
As observed during working on this fix (#130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance.
By increasing the default max threads per block from 128 to 256, I saw for
aten::index_select
, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script:I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio:
128 | 1x
256 | 1.33x
512 | 1.44x
1024 | 1.49x
Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script.
Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo