fix for launching kernel invalid config error when calling embedding … #130994

hongxiayang · 2024-07-17T22:39:17Z

…with large index

Fixes #130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument

What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found issues in the Indexing.cu:
On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648

As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause invalid configuration argument error.

[Test]
Run the same code snippet in the issue, and print the output, its dim and numel(), which looks like below now:

output=tensor([[ 0.4044, -0.0244, -0.6865,  ..., -0.7800,  0.1175,  1.6726],
        [-1.0866, -0.1609,  0.3538,  ...,  1.9105,  0.7882,  1.1583],
        [-2.2079,  0.3736,  0.3610,  ..., -0.2658, -0.0459,  1.3077],
        ...,
        [ 0.8753, -0.7482, -0.1978,  ...,  0.9016,  1.1501, -0.5178],
        [-1.5845, -0.6277,  1.4520,  ...,  0.5733, -2.1198, -0.0915],
        [-0.6310, -1.0239, -0.1910,  ...,  0.4309,  0.1630,  0.3239]],
       device='cuda:0'), dim=2, numel=2147483648

Added a large tensor unit test too.

/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected                                                                                                                                        
Running 1 items in this shard

test/nn/test_embedding.py .                                                                                                                                                        [100%]

=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================

…with large index

pytorch-bot · 2024-07-17T22:39:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130994

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 42d6034 with merge base fdd0a7f ():

NEW FAILURE - The following job has failed:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
RuntimeError: doctests 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeffdaily

Admittedly, I don't yet understand what went wrong and why this fixes it. But I do see some use of long and some of int64_t, so perhaps settle on using int64_t?

aten/src/ATen/native/cuda/Indexing.cu

hongxiayang · 2024-07-19T01:32:24Z

Admittedly, I don't yet understand what went wrong and why this fixes it. But I do see some use of long and some of int64_t, so perhaps settle on using int64_t?

I updated the description in the PR about what went wrong, hoping that helps for the understanding.

hongxiayang · 2024-07-19T03:22:13Z

There is more opportunity to refactor the code to make it better. Will leave it for future work as this is an urgent ask.

facebook-github-bot · 2024-07-19T03:46:22Z

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eqy

Is it possible to add a test (perhaps decorated with @largeTensorTest) if necessary for this case

hongxiayang · 2024-07-19T17:10:43Z

Added unit test.

jeffdaily

Approve if CI signal is good.

hongxiayang · 2024-07-19T22:14:11Z

Approve if CI signal is good.

The only failing one is that Meta internal Diff is not in sync with external PR as I added a unit test as suggested by @eqy after @xw285cornell imported it. @xw285cornell Please import again to make it sync. Thanks.

facebook-github-bot · 2024-07-20T03:42:48Z

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xw285cornell · 2024-07-20T07:46:42Z

@pytorchbot merge

pytorchmergebot · 2024-07-20T07:48:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xw285cornell · 2024-07-20T07:49:44Z

@hongxiayang @jeffdaily thanks for the fix!

Just wondering, "2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648", shall we fix this in rocm?

pytorchmergebot · 2024-07-20T08:29:50Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2024-07-20T08:31:48Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-07-20T08:33:19Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xintwfb · 2024-07-20T19:22:41Z

aten/src/ATen/native/cuda/Indexing.cu

-  int64_t selfReduceDimSize = self_.size(dim);
-  ptrdiff_t numIndex = index.numel();
-  int64_t selfNumel = self_.numel();
+  uint64_t sliceSize = getSliceSize(self_, dim, index, source_);


Thanks @hongxiayang for the fix!
In the description you mentioned 1 and 2, and the diff looks mainly fix 1 by using uint64_t which has larger representable range. But I just wondering if we actually should fix 2: On ROCm, std::min -> ::min did not work as expected instead? (although uint64_t has much larger representable, but in a extreme, in the future if our LLM context length continue to grow, will it to a point that beyond uint64_t representable range?)

This pull request fixed 1 using uint64_t and had a work-around for 2 using "<" instead of "min" function. The complete fix should use < in the lower level of library for support for 64bit integers in addition to "int". I am following up for 2 now to see why such support was not there for pytorch to use.

pytorch#130994) …with large index Fixes pytorch#130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](https://github.com/pytorch/pytorch/blame/39493aa93419532957e6e5ee97cae842b53b8b59/aten/src/ATen/native/cuda/Indexing.cu#L1367): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](pytorch#130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: pytorch#130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell

hongxiayang · 2024-07-23T14:07:31Z

@hongxiayang @jeffdaily thanks for the fix!

Just wondering, "2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648", shall we fix this in rocm?

will follow up on this. thanks.

pytorch#130994) …with large index Fixes pytorch#130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](https://github.com/pytorch/pytorch/blame/39493aa93419532957e6e5ee97cae842b53b8b59/aten/src/ATen/native/cuda/Indexing.cu#L1367): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](pytorch#130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: pytorch#130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell

As observed during working on this fix (#130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance. By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script: ``` input = torch.randint(low=0, high=16032, size=[131072], device="cuda") w = torch.randn([16032, 16384], device="cuda") with profiler.profile(record_shapes=True) as prof: x = torch.nn.functional.embedding(input, w) ``` I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio: 128 | 1x 256 | 1.33x 512 | 1.44x 1024 | 1.49x Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script. Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster. Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: #131713 Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet

hongxiayang · 2024-08-13T18:30:04Z

@pytorchbot cherry-pick --onto release/2.4 -c critical

#130994) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](https://github.com/pytorch/pytorch/blame/39493aa93419532957e6e5ee97cae842b53b8b59/aten/src/ATen/native/cuda/Indexing.cu#L1367): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](#130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: #130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell (cherry picked from commit 637ab85)

pytorchbot · 2024-08-13T18:34:43Z

Cherry picking #130994

The cherry pick PR is at #133346 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v2.4.1] Release Tracker #132400 (comment)

Details for Dev Infra team

Raised by workflow job

#133346) fix for launching kernel invalid config error when calling embedding … (#130994) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](https://github.com/pytorch/pytorch/blame/39493aa93419532957e6e5ee97cae842b53b8b59/aten/src/ATen/native/cuda/Indexing.cu#L1367): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](#130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: #130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell (cherry picked from commit 637ab85) Co-authored-by: hongxyan <hongxyan@amd.com>

pytorch#133346) fix for launching kernel invalid config error when calling embedding … (pytorch#130994) …with large index Fixes pytorch#130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](https://github.com/pytorch/pytorch/blame/39493aa93419532957e6e5ee97cae842b53b8b59/aten/src/ATen/native/cuda/Indexing.cu#L1367): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](pytorch#130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: pytorch#130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell (cherry picked from commit 637ab85) Co-authored-by: hongxyan <hongxyan@amd.com>

Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648 Replace `std::min` to ternary statement Also `std::min` can be replaced by explicit typing `std::min<int64_t>` fixes on ROCm: test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16 error: RuntimeError: Cannot sort dimension of length 8192 Similar PR to fix large tensors on ROCm #130994 Pull Request resolved: #161054 Approved by: https://github.com/jeffdaily

fix for launching kernel invalid config error when calling embedding …

27624e6

…with large index

pytorch-bot bot added the release notes: cuda release notes category label Jul 17, 2024

hongxiayang requested review from jeffdaily and xw285cornell July 17, 2024 22:42

pytorchbot added the open source label Jul 17, 2024

jeffdaily requested changes Jul 17, 2024

View reviewed changes

type matching

8d77de7

hongxiayang marked this pull request as ready for review July 19, 2024 03:20

hongxiayang requested a review from eqy as a code owner July 19, 2024 03:20

hongxiayang requested a review from jeffdaily July 19, 2024 03:28

eqy reviewed Jul 19, 2024

View reviewed changes

add a large tensor test for embedding

6996be0

hongxiayang requested a review from eqy July 19, 2024 17:09

lint

42d6034

jeffdaily approved these changes Jul 19, 2024

View reviewed changes

xw285cornell approved these changes Jul 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 20, 2024

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot removed the merging label Jul 20, 2024

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot added the Merged label Jul 20, 2024

pytorchmergebot closed this in 637ab85 Jul 20, 2024

pytorchmergebot removed the merging label Jul 20, 2024

xintwfb reviewed Jul 20, 2024

View reviewed changes

hongxiayang mentioned this pull request Jul 24, 2024

[ROCm] performance optimization for index select #131713

Closed

This was referenced Jul 26, 2024

[Bug][ROCm] The embedding layer does not support long inputs vllm-project/vllm#6807

Closed

[Build/CI][ROCm] Minor simplification to Dockerfile.rocm vllm-project/vllm#6811

Merged

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

pruthvistony added this to the 2.4.1 milestone Aug 13, 2024

pytorchbot mentioned this pull request Aug 13, 2024

[v2.4.1] Release Tracker #132400

Closed

hongxiayang mentioned this pull request Aug 15, 2024

fix for launching kernel invalid config error when calling embedding … #133346

Merged

atalman mentioned this pull request Aug 28, 2024

Release 2.4.1 validations checklist and cherry-picks #134694

Closed

40 tasks

hongxiayang mentioned this pull request Dec 5, 2024

On AMD GPUs (ROCm 5.7-6.2), cannot backpropagate loss tensor containing more than 2e8 elements #136291

Closed

dnikolaev-amd mentioned this pull request Aug 20, 2025

[ROCm] fix large tensor sort on MI350 #161054

Closed

fix for launching kernel invalid config error when calling embedding … #130994

fix for launching kernel invalid config error when calling embedding … #130994

Uh oh!

Conversation

hongxiayang commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130994

❌ 1 New Failure

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hongxiayang commented Jul 19, 2024

Uh oh!

hongxiayang commented Jul 19, 2024

Uh oh!

facebook-github-bot commented Jul 19, 2024

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

hongxiayang commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

hongxiayang commented Jul 19, 2024

Uh oh!

facebook-github-bot commented Jul 20, 2024

Uh oh!

xw285cornell commented Jul 20, 2024

Uh oh!

pytorchmergebot commented Jul 20, 2024

Merge started

Uh oh!

xw285cornell commented Jul 20, 2024

Uh oh!

pytorchmergebot commented Jul 20, 2024

Merge failed

Uh oh!

facebook-github-bot commented Jul 20, 2024

Uh oh!

pytorchmergebot commented Jul 20, 2024

Merge started

Uh oh!

xintwfb Jul 20, 2024

Choose a reason for hiding this comment

Uh oh!

hongxiayang Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

hongxiayang commented Jul 23, 2024

Uh oh!

hongxiayang commented Aug 13, 2024

Uh oh!

pytorchbot commented Aug 13, 2024

Cherry picking #130994

Uh oh!

Uh oh!

hongxiayang commented Jul 17, 2024 •

edited

Loading

pytorch-bot bot commented Jul 17, 2024 •

edited

Loading

hongxiayang commented Jul 19, 2024 •

edited

Loading