[ROCm] performance optimization for index select #131713

hongxiayang · 2024-07-24T23:13:05Z

As observed during working on this fix (#130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance.

By increasing the default max threads per block from 128 to 256, I saw for aten::index_select, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script:

input = torch.randint(low=0, high=16032, size=[131072], device="cuda")
w = torch.randn([16032, 16384], device="cuda")

with profiler.profile(record_shapes=True) as prof:
    x = torch.nn.functional.embedding(input, w)

I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio:
128 | 1x
256 | 1.33x
512 | 1.44x
1024 | 1.49x

Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script.

Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo

… per block

pytorch-bot · 2024-07-24T23:13:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131713

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 348162f with merge base d355678 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeffdaily

One nit change and one necessary change requested, but otherwise LGTM if CI is green.

aten/src/ATen/native/cuda/Indexing.cu

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

hongxiayang · 2024-07-26T01:49:06Z

cc @xw285cornell

hongxiayang · 2024-07-26T01:55:01Z

@eqy Please let me know whether you are comfortable for me to change the default to 256 for Nvidia case. We can change all the places of 128 to 256 as a follow up.

aten/src/ATen/native/cuda/Indexing.cu

syed-ahmed

Requested a small change. Otherwise, LGTM.

aten/src/ATen/native/cuda/Indexing.cu

hongxiayang · 2024-07-31T00:11:08Z

@pytorchbot merge

pytorchmergebot · 2024-07-31T00:12:46Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jithunnair-amd · 2024-07-31T16:15:26Z

@malfet Can you please review and approve this PR?

malfet · 2024-07-31T16:16:46Z

@pytorchbot merge

pytorchmergebot · 2024-07-31T16:18:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ROCm] performance optimization by increasing the default max threads…

9851283

… per block

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Jul 24, 2024

pytorchbot added the open source label Jul 24, 2024

jeffdaily requested changes Jul 25, 2024

View reviewed changes

aten/src/ATen/native/cuda/Indexing.cu Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/Indexing.cu Outdated Show resolved Hide resolved

jeffdaily changed the title ~~[ROCm] performance optimization by increasing the default max threads…~~ [ROCm] performance optimization for index select Jul 25, 2024

hongxiayang and others added 2 commits July 25, 2024 18:05

Update aten/src/ATen/native/cuda/Indexing.cu

f5d491b

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

move the new func to unnamed namespace as per feedback

0a855d6

hongxiayang marked this pull request as ready for review July 26, 2024 01:48

hongxiayang requested a review from eqy as a code owner July 26, 2024 01:48

hongxiayang requested a review from jeffdaily July 26, 2024 01:48

jeffdaily approved these changes Jul 26, 2024

View reviewed changes

hongxiayang requested a review from xw285cornell July 26, 2024 15:55

hongxiayang mentioned this pull request Jul 26, 2024

[Bug][ROCm] The embedding layer does not support long inputs vllm-project/vllm#6807

Closed

xw285cornell reviewed Jul 27, 2024

View reviewed changes

aten/src/ATen/native/cuda/Indexing.cu Outdated Show resolved Hide resolved

use 512 as the new default

97c3f25

hongxiayang requested a review from syed-ahmed as a code owner July 30, 2024 17:51

syed-ahmed requested changes Jul 30, 2024

View reviewed changes

aten/src/ATen/native/cuda/Indexing.cu Outdated Show resolved Hide resolved

add constexpr as per review feedback after removing mi300x specific code

348162f

hongxiayang requested a review from syed-ahmed July 30, 2024 19:05

syed-ahmed approved these changes Jul 30, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2024

pytorchmergebot added the merging label Jul 31, 2024

pytorchmergebot removed the merging label Jul 31, 2024

malfet approved these changes Jul 31, 2024

View reviewed changes

pytorchmergebot added the merging label Jul 31, 2024

pytorchmergebot added the Merged label Jul 31, 2024

pytorchmergebot closed this in c85088b Jul 31, 2024

pytorchmergebot removed the merging label Jul 31, 2024

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] performance optimization for index select #131713

[ROCm] performance optimization for index select #131713

Uh oh!

hongxiayang commented Jul 24, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 24, 2024 •

edited

Loading

Uh oh!

jeffdaily left a comment

Uh oh!

Uh oh!

Uh oh!

hongxiayang commented Jul 26, 2024

Uh oh!

hongxiayang commented Jul 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

syed-ahmed left a comment

Uh oh!

Uh oh!

hongxiayang commented Jul 31, 2024

Uh oh!

pytorchmergebot commented Jul 31, 2024

Uh oh!

jithunnair-amd commented Jul 31, 2024

Uh oh!

malfet commented Jul 31, 2024

Uh oh!

pytorchmergebot commented Jul 31, 2024

Uh oh!

Uh oh!

[ROCm] performance optimization for index select #131713

[ROCm] performance optimization for index select #131713

Uh oh!

Conversation

hongxiayang commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131713

✅ No Failures

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hongxiayang commented Jul 26, 2024

Uh oh!

hongxiayang commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

syed-ahmed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongxiayang commented Jul 31, 2024

Uh oh!

pytorchmergebot commented Jul 31, 2024

Merge failed

Uh oh!

jithunnair-amd commented Jul 31, 2024

Uh oh!

malfet commented Jul 31, 2024

Uh oh!

pytorchmergebot commented Jul 31, 2024

Merge started

Uh oh!

Uh oh!

hongxiayang commented Jul 24, 2024 •

edited

Loading

pytorch-bot bot commented Jul 24, 2024 •

edited

Loading

hongxiayang commented Jul 26, 2024 •

edited

Loading