[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings #42249

dzhulgakov · 2020-07-29T20:03:34Z

Summary:
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

Case	Call to device_count()	init_cuda, e.g. allocating tensor
all good	non-zero	just works
no gpus	0, no warning	throw exception with good message
driver issues	0, produce warning	throw exception with good message
out of memory with ASAN	0, produce warning	throw exception with ASAN message

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:

cache device_count() always in a static variable
move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')

Differential Revision: D22824329

facebook-github-bot · 2020-07-29T20:04:17Z

This pull request was exported from Phabricator. Differential Revision: D22824329

dr-ci · 2020-07-29T20:12:57Z

💊 CI failures summary and remediations

As of commit c38b4d9 (more details on the Dr. CI page):

1/4 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
3/4 broken upstream at merge base ec898b1 on Aug 04 from 10:12am to 11:35am PDT (2 commits; ec898b1 - 94e8676)

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_xla_linux_bionic_py3_6_clang9_test on Aug 04 from 10:12am to 11:35am PDT (2 commits; ec898b1 - 94e8676)
- 🔁 rerun
pytorch_doc_test on Aug 04 from 8:36am to 3:15pm PDT (17 commits; b56db30 - 56fc7d0)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 33 times.

facebook-github-bot · 2020-07-29T21:04:35Z

This pull request was exported from Phabricator. Differential Revision: D22824329

dzhulgakov · 2020-07-29T21:09:32Z

@smessmer @iotamudelta @ezyang - do you know what magic tricks I need to do to make it work with HIP? I can see that device_count is special cased in the hipify script: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py#L8069 so I don't get how it was working before :)

ngimel · 2020-07-29T21:47:52Z

cc @jeffdaily for hip questions. Windows failures are real.

jeffdaily · 2020-07-29T23:03:01Z

HIP failures are

build/lib/libtorch_hip.so: undefined reference to `c10::hip::device_count()'
build/lib/libtorch_hip.so: undefined reference to `c10::hip::device_count_ensure_non_zero()'

jeffdaily · 2020-07-29T23:07:39Z

c10/cuda/CMakeLists.txt

@@ -21,6 +21,7 @@ configure_file(
 # and headers you add
 set(C10_CUDA_SRCS
    CUDAStream.cpp
+    CUDAFunctions.cpp


See note about about adding new source file to hipify mappings.

https://github.com/pytorch/pytorch/blob/cc906ee6190699840048497e575aa27f2c5c63e2/c10/cuda/CMakeLists.txt#L19-L24

I've seen that, but I don't see any .cpp files listed in cuda_to_hip_mappings.py. Note that I'm adding a .cpp file for an existing CUDAFunctions.h file and it's already listed in cuda_to_hip_mappings.

Also there's some special casing for device_count() in it but I can't figure out how it works: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/cuda_to_hip_mappings.py#L8069

Seems like it was the same missing C10_CUDA_API macro problem

ngimel

Approving, modulo hip and windows failures.

ngimel · 2020-07-29T23:26:58Z

C10_CUDA_API macro is likely what you need for visibility issues.

facebook-github-bot · 2020-07-30T00:20:02Z

This pull request was exported from Phabricator. Differential Revision: D22824329

facebook-github-bot · 2020-07-30T04:00:34Z

This pull request was exported from Phabricator. Differential Revision: D22824329

facebook-github-bot · 2020-08-04T15:40:43Z

This pull request was exported from Phabricator. Differential Revision: D22824329

…gs (pytorch#42249) Summary: Pull Request resolved: pytorch#42249 Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths. Basic logic: | Case | Call to device_count() | init_cuda, e.g. allocating tensor | | -- | -- | -- | | all good | non-zero | just works | | no gpus | 0, no warning | throw exception with good message | | driver issues | 0, produce warning | throw exception with good message | | out of memory with ASAN | 0, produce warning| throw exception with ASAN message | Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs. Other clean up changes: * cache device_count() always in a static variable * move all asan macros in c10 Test Plan: Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=): ``` print('before import') import torch print('after import') print('devices: ', torch.cuda.device_count()) x = torch.tensor([1,2,3]) print('tensor creation') x = x.cuda() print('moved to cuda') ``` Differential Revision: D22824329 fbshipit-source-id: 69ffc73b70f27107e367533c36ee0fc55626dbf7

facebook-github-bot · 2020-08-04T18:08:30Z

This pull request was exported from Phabricator. Differential Revision: D22824329

facebook-github-bot · 2020-08-06T00:17:55Z

This pull request has been merged in 06d978a.

facebook-github-bot added the fb-exported label Jul 29, 2020

dzhulgakov force-pushed the export-D22824329 branch from a3d5dc7 to cc906ee Compare July 29, 2020 21:04

dzhulgakov requested review from ezyang, smessmer and ngimel July 29, 2020 21:09

jeffdaily reviewed Jul 29, 2020

View reviewed changes

ngimel approved these changes Jul 29, 2020

View reviewed changes

dzhulgakov force-pushed the export-D22824329 branch from cc906ee to c7c8507 Compare July 30, 2020 00:20

dzhulgakov force-pushed the export-D22824329 branch from c7c8507 to 22bf5f8 Compare July 30, 2020 04:00

dzhulgakov force-pushed the export-D22824329 branch from 22bf5f8 to 4c86dfc Compare August 4, 2020 15:40

dzhulgakov force-pushed the export-D22824329 branch from 4c86dfc to c38b4d9 Compare August 4, 2020 18:08

facebook-github-bot closed this in 06d978a Aug 5, 2020

facebook-github-bot added the merged label Aug 6, 2020

cth7 mentioned this pull request Sep 20, 2020

torch._C._cuda_getDriverVersion() reporting CUDA version instead of NVIDIA driver version #38572

Closed

shagunsodhani mentioned this pull request Oct 27, 2020

ignore UserWarning introduced by torch 1.7.0 facebookresearch/hydra#1095

Merged

jieru-hu mentioned this pull request Oct 28, 2020

new UserWarning introduced by torch 1.7.0 facebook/Ax#417

Closed

mruberry added the Merged label Oct 28, 2020

carmocca mentioned this pull request Oct 29, 2020

PyTorch 1.7.0 CUDA driver warning #47038

Closed

[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings #42249

[c10/cuda] Reorganize device_count() and robustly surface ASAN warnings #42249

Uh oh!

Conversation

dzhulgakov commented Jul 29, 2020

Uh oh!

facebook-github-bot commented Jul 29, 2020

Uh oh!

dr-ci bot commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 2 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot commented Jul 29, 2020

Uh oh!

dzhulgakov commented Jul 29, 2020

Uh oh!

ngimel commented Jul 29, 2020

Uh oh!

jeffdaily commented Jul 29, 2020

Uh oh!

jeffdaily Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

jeffdaily Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

dzhulgakov Jul 30, 2020

Choose a reason for hiding this comment

Uh oh!

dzhulgakov Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 29, 2020

Uh oh!

facebook-github-bot commented Jul 30, 2020

Uh oh!

facebook-github-bot commented Jul 30, 2020

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Uh oh!

dr-ci bot commented Jul 29, 2020 •

edited

Loading