Skip to content

GPU tests can fail with invalid memory access due to compiler generating invalid code #116289

@atalman

Description

@atalman

🐛 Describe the bug

Initially was reported only on 2.2, but happen on trunk as well, see https://github.com/pytorch/pytorch/actions/runs/7342826148/job/19993362477 for example

HUD link:
https://hud.pytorch.org/hud/pytorch/pytorch/release%2F2.2/1?per_page=50&name_filter=linux-focal-cuda12.1-py3.10-gcc9-sm86

Repro:
python test/test_ops.py -k test_dtypes__refs_nextafter_cuda
or
CUDA_LAUNCH_BLOCKING=1 python test/test_ops.py TestCommonCUDA

Error:

CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    test_dtypes_nextafter_cuda errored - num_retries_left: 3
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 916, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 970, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 1145, in only_fn
    return fn(self, *args, **kwargs)
  File "/var/lib/jenkins/workspace/test/test_ops.py", line 1458, in test_dtypes
    self.fail(msg)
AssertionError: The supported dtypes for nextafter on device type cuda are incorrect!
The following dtypes did not work in forward but are listed by the OpInfo: {torch.float64, torch.bfloat16}.
Unexpected failures raised the following errors:
torch.float64 - CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

torch.bfloat16 - CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2652, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2652, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 416, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 922, in test_wrapper
    raise Exception(
Exception: Caused by sample input at index 8: SampleInput(input=Tensor[size=(0, 1, 3), device="cuda:0", dtype=torch.float32], args=TensorList[Tensor[size=(0, 10, 3), device="cuda:0", dtype=torch.float32]], kwargs={}, broadcasts_input=True, name='')

To execute this test, run the following from the base repo dir:
     python test/test_ops.py -k test_dtypes_nextafter_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    test_dtypes_nextafter_cuda errored - num_retries_left: 2
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
    set_rng_seed(SEED)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
    torch.manual_seed(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
    callable()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    test_dtypes_nextafter_cuda errored - num_retries_left: 1
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
    set_rng_seed(SEED)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
    torch.manual_seed(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
    callable()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


ETEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    test_dtypes_nextafter_cuda errored - num_retries_left: 0

======================================================================
ERROR [0.004s]: test_dtypes_nextafter_cuda (__main__.TestCommonCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2826, in setUp
    set_rng_seed(SEED)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1852, in set_rng_seed
    torch.manual_seed(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 126, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 232, in _lazy_call
    callable()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/random.py", line 124, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


----------------------------------------------------------------------
Ran 1436 tests in 218.112s

FAILED (errors=1, skipped=597, expected failures=5)

Versions

cc @ptrblck @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry @ZainRizvi

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ciRelated to continuous integrationmodule: cudaRelated to torch.cuda, and CUDA support in generaloncall: relengIn support of CI and Release EngineeringtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions