-
Notifications
You must be signed in to change notification settings - Fork 25.1k
[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings #42249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D22824329 |
💊 CI failures summary and remediationsAs of commit c38b4d9 (more details on the Dr. CI page):
🚧 2 fixed upstream failures:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
This pull request was exported from Phabricator. Differential Revision: D22824329 |
a3d5dc7
to
cc906ee
Compare
@smessmer @iotamudelta @ezyang - do you know what magic tricks I need to do to make it work with HIP? I can see that |
cc @jeffdaily for hip questions. Windows failures are real. |
HIP failures are
|
@@ -21,6 +21,7 @@ configure_file( | |||
# and headers you add | |||
set(C10_CUDA_SRCS | |||
CUDAStream.cpp | |||
CUDAFunctions.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note about about adding new source file to hipify mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that, but I don't see any .cpp files listed in cuda_to_hip_mappings.py. Note that I'm adding a .cpp file for an existing CUDAFunctions.h file and it's already listed in cuda_to_hip_mappings.
Also there's some special casing for device_count() in it but I can't figure out how it works: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py#L8069
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it was the same missing C10_CUDA_API macro problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, modulo hip and windows failures.
C10_CUDA_API macro is likely what you need for visibility issues. |
This pull request was exported from Phabricator. Differential Revision: D22824329 |
cc906ee
to
c7c8507
Compare
This pull request was exported from Phabricator. Differential Revision: D22824329 |
c7c8507
to
22bf5f8
Compare
This pull request was exported from Phabricator. Differential Revision: D22824329 |
22bf5f8
to
4c86dfc
Compare
…gs (pytorch#42249) Summary: Pull Request resolved: pytorch#42249 Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths. Basic logic: | Case | Call to device_count() | init_cuda, e.g. allocating tensor | | -- | -- | -- | | all good | non-zero | just works | | no gpus | 0, no warning | throw exception with good message | | driver issues | 0, produce warning | throw exception with good message | | out of memory with ASAN | 0, produce warning| throw exception with ASAN message | Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs. Other clean up changes: * cache device_count() always in a static variable * move all asan macros in c10 Test Plan: Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=): ``` print('before import') import torch print('after import') print('devices: ', torch.cuda.device_count()) x = torch.tensor([1,2,3]) print('tensor creation') x = x.cuda() print('moved to cuda') ``` Differential Revision: D22824329 fbshipit-source-id: 69ffc73b70f27107e367533c36ee0fc55626dbf7
4c86dfc
to
c38b4d9
Compare
This pull request was exported from Phabricator. Differential Revision: D22824329 |
This pull request has been merged in 06d978a. |
Summary:
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
Differential Revision: D22824329