Skip to content

NaN tensor values problem for GTX16xx users (no problem on other devices) #7908

@YipKo

Description

@YipKo

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Validation

Bug

I used yolov5 to test with the demo dataset (coco128) and found that box and obj are nan. Also, there are no detections appear on validation images. This only happens on GTX1660ti devices (GPU mode), when I use CPU or use Google colab(Tesla K80) / RTX2070 for training, everything works fine.
image

Environment

  • Windows 10 10.0.19044.1706
  • YOLOv5-6.1 (version 6.1)
  • Nvidia GTX 1660 TI, 6 GB
  • Python3.9
  • cudatoolkit-11.3.1
  • pytorch-1.11.0-py3.9_cuda11.3_cudnn8_0
  • (also tried pytorch-1.11.0-py3.9_cuda11.5_cudnn8_0)
  • (with dependencies installed correctly)

Minimal Reproducible Example

The command used for training is
python train.py

Additional

There are issues here also discussing the same problem.

However, I have tried pytorch with cuda version 11.5 (whose cudnn version is 8.3.0>8.2.2) and I also tried downloading cuDNN from nvidia and copy/paste the dll files into the relevant folder in torch/lib , the problem still can not be solved.

Another workaround is to downgrade to pytorch with cuda version 10.2(tested and it works), but this is currently not feasible as CUDA-10.2 PyTorch builds are no longer available for Windows.

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions