Skip to content

Conversation

dzhulgakov
Copy link
Collaborator

Summary:
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

Case Call to device_count() init_cuda, e.g. allocating tensor
all good non-zero just works
no gpus 0, no warning throw exception with good message
driver issues 0, produce warning throw exception with good message
out of memory with ASAN 0, produce warning throw exception with ASAN message

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:

  • cache device_count() always in a static variable
  • move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')

Differential Revision: D22824329

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

@dr-ci
Copy link

dr-ci bot commented Jul 29, 2020

💊 CI failures summary and remediations

As of commit c38b4d9 (more details on the Dr. CI page):


  • 1/4 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 3/4 broken upstream at merge base ec898b1 on Aug 04 from 10:12am to 11:35am PDT (2 commits; ec898b1 - 94e8676)

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 33 times.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

@dzhulgakov
Copy link
Collaborator Author

@smessmer @iotamudelta @ezyang - do you know what magic tricks I need to do to make it work with HIP? I can see that device_count is special cased in the hipify script: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py#L8069 so I don't get how it was working before :)

@dzhulgakov dzhulgakov requested review from ezyang, smessmer and ngimel July 29, 2020 21:09
@ngimel
Copy link
Collaborator

ngimel commented Jul 29, 2020

cc @jeffdaily for hip questions. Windows failures are real.

@jeffdaily
Copy link
Collaborator

HIP failures are

build/lib/libtorch_hip.so: undefined reference to `c10::hip::device_count()'
build/lib/libtorch_hip.so: undefined reference to `c10::hip::device_count_ensure_non_zero()'

@@ -21,6 +21,7 @@ configure_file(
# and headers you add
set(C10_CUDA_SRCS
CUDAStream.cpp
CUDAFunctions.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note about about adding new source file to hipify mappings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that, but I don't see any .cpp files listed in cuda_to_hip_mappings.py. Note that I'm adding a .cpp file for an existing CUDAFunctions.h file and it's already listed in cuda_to_hip_mappings.

Also there's some special casing for device_count() in it but I can't figure out how it works: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py#L8069

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it was the same missing C10_CUDA_API macro problem

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, modulo hip and windows failures.

@ngimel
Copy link
Collaborator

ngimel commented Jul 29, 2020

C10_CUDA_API macro is likely what you need for visibility issues.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

…gs (pytorch#42249)

Summary:
Pull Request resolved: pytorch#42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Differential Revision: D22824329

fbshipit-source-id: 69ffc73b70f27107e367533c36ee0fc55626dbf7
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22824329

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 06d978a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants