Skip to content

Conversation

amd-sriram
Copy link
Contributor

@amd-sriram amd-sriram commented Mar 26, 2025

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

)

Altering the flag to use the correct streamType for
CUDAPluggableAllocator. This is impacting Distributed Fused Adam in
Rocm/APEX.

See PR ROCm/apex#184

Related Issue : https://ontrack-internal.amd.com/browse/SWDEV-519796
Copy link

pytorch-bot bot commented Mar 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150010

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 26 Pending, 5 Unrelated Failures

As of commit ea4e08c with merge base de68ddc (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Mar 26, 2025
Copy link

linux-foundation-easycla bot commented Mar 26, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: amd-sriram / name: Sriram Kumar (ea4e08c)

@amd-sriram amd-sriram marked this pull request as ready for review March 26, 2025 08:32
@amd-sriram
Copy link
Contributor Author

@pytorchbot label "release notes: rocm"

@pytorch-bot pytorch-bot bot added the release notes: rocm mandatorylabel label Mar 26, 2025
@jataylo jataylo added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Mar 26, 2025
Copy link

pytorch-bot bot commented Mar 26, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-bot bot commented Mar 26, 2025

To add the ciflow label ciflow/periodic please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-bot bot commented Mar 26, 2025

To add the ciflow label ciflow/inductor-rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Mar 26, 2025
@jataylo jataylo added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 26, 2025
@soulitzer soulitzer requested a review from jeffdaily March 26, 2025 22:45
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 26, 2025
@amd-sriram
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated CI failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

Pull Request resolved: pytorch#150010
Approved by: https://github.com/jeffdaily
@jithunnair-amd
Copy link
Collaborator

@pytorchbot cherry-pick --onto release/2.7

Copy link

pytorch-bot bot commented May 20, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

@jithunnair-amd
Copy link
Collaborator

@pytorchbot cherry-pick --onto release/2.7 -c fixnewfeature

pytorchbot pushed a commit that referenced this pull request May 20, 2025
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

Pull Request resolved: #150010
Approved by: https://github.com/jeffdaily

(cherry picked from commit a19b667)
@pytorchbot
Copy link
Collaborator

Cherry picking #150010

The cherry pick PR is at #153974 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

@jithunnair-amd jithunnair-amd requested a review from atalman May 20, 2025 20:59
atalman pushed a commit that referenced this pull request May 22, 2025
[ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)

Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR ROCm/apex#184

Pull Request resolved: #150010
Approved by: https://github.com/jeffdaily

(cherry picked from commit a19b667)

Co-authored-by: Sriram Kumar <skishore@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: rocm mandatorylabel triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants