-
-
Notifications
You must be signed in to change notification settings - Fork 17.2k
Description
Search before asking
- I have searched the YOLOv5 issues and found no similar bug report.
YOLOv5 Component
Training, Validation
Bug
I used yolov5 to test with the demo dataset (coco128) and found that box and obj are nan. Also, there are no detections appear on validation images. This only happens on GTX1660ti devices (GPU mode), when I use CPU or use Google colab(Tesla K80) / RTX2070 for training, everything works fine.
Environment
- Windows 10 10.0.19044.1706
- YOLOv5-6.1 (version 6.1)
- Nvidia GTX 1660 TI, 6 GB
- Python3.9
- cudatoolkit-11.3.1
- pytorch-1.11.0-py3.9_cuda11.3_cudnn8_0
- (also tried pytorch-1.11.0-py3.9_cuda11.5_cudnn8_0)
- (with dependencies installed correctly)
Minimal Reproducible Example
The command used for training is
python train.py
Additional
There are issues here also discussing the same problem.
- FP16 inference with Cuda 11.1 returns NaN on Nvidia GTX 1660 pytorch/pytorch#58123
- In GPU mode generated image is all black with NaN tensor values (no problems in CPU mode) openai/glide-text2im#31
- https://discuss.pytorch.org/t/half-precision-convolution-cause-nan-in-forward-pass/117358/3
- 'NAN' in model features pytorch/pytorch#69449
- I am getting nan and no predictions at all. #5815
However, I have tried pytorch with cuda version 11.5 (whose cudnn version is 8.3.0>8.2.2) and I also tried downloading cuDNN from nvidia and copy/paste the dll files into the relevant folder in torch/lib , the problem still can not be solved.
Another workaround is to downgrade to pytorch with cuda version 10.2(tested and it works), but this is currently not feasible as CUDA-10.2 PyTorch builds are no longer available for Windows.
Are you willing to submit a PR?
- Yes I'd like to help by submitting a PR!