-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Closed
Closed
Copy link
Labels
module: ciRelated to continuous integrationRelated to continuous integrationmodule: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaloncall: relengIn support of CI and Release EngineeringIn support of CI and Release EngineeringtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
Initially was reported only on 2.2, but happen on trunk as well, see https://github.com/pytorch/pytorch/actions/runs/7342826148/job/19993362477 for example
Repro:
python test/test_ops.py -k test_dtypes__refs_nextafter_cuda
or
CUDA_LAUNCH_BLOCKING=1 python test/test_ops.py TestCommonCUDA
Error:
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
test_dtypes_nextafter_cuda errored - num_retries_left: 3
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 916, in test_wrapper
return test(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 970, in dep_fn
return fn(slf, *args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 1145, in only_fn
return fn(self, *args, **kwargs)
File "/var/lib/jenkins/workspace/test/test_ops.py", line 1458, in test_dtypes
self.fail(msg)
AssertionError: The supported dtypes for nextafter on device type cuda are incorrect!
The following dtypes did not work in forward but are listed by the OpInfo: {torch.float64, torch.bfloat16}.
Unexpected failures raised the following errors:
torch.float64 - CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
torch.bfloat16 - CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2652, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2652, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 416, in instantiated_test
result = test(self, **param_kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 922, in test_wrapper
raise Exception(
Exception: Caused by sample input at index 8: SampleInput(input=Tensor[size=(0, 1, 3), device="cuda:0", dtype=torch.float32], args=TensorList[Tensor[size=(0, 10, 3), device="cuda:0", dtype=torch.float32]], kwargs={}, broadcasts_input=True, name='')
To execute this test, run the following from the base repo dir:
python test/test_ops.py -k test_dtypes_nextafter_cuda
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
test_dtypes_nextafter_cuda errored - num_retries_left: 2
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
set_rng_seed(SEED)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
torch.manual_seed(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
callable()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
test_dtypes_nextafter_cuda errored - num_retries_left: 1
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
set_rng_seed(SEED)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
torch.manual_seed(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
callable()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
test_dtypes_nextafter_cuda errored - num_retries_left: 0
======================================================================
ERROR [0.004s]: test_dtypes_nextafter_cuda (__main__.TestCommonCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
set_rng_seed(SEED)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
torch.manual_seed(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
callable()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
----------------------------------------------------------------------
Ran 1436 tests in 218.112s
FAILED (errors=1, skipped=597, expected failures=5)
Versions
cc @ptrblck @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry @ZainRizvi
Metadata
Metadata
Assignees
Labels
module: ciRelated to continuous integrationRelated to continuous integrationmodule: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaloncall: relengIn support of CI and Release EngineeringIn support of CI and Release EngineeringtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done
Status
Done