Skip to content

Always get error "ConnectionResetError: [Errno 104] Connection reset by peer" #9127

@MabinogiX

Description

@MabinogiX

Issue description

I found someone had reported this error and could not reproduce it. However I always get this error during my training. When I get this error, the code is still running, so I continue to get this problem. It seems that it has no effect on training.

Code example

train_dataset = lmdbDataset(root=opt.trainroot, transform=resizeNormalize(size=(592, 32)))
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=opt.batchSize,
    shuffle=True, sampler=None,
    num_workers=int(opt.workers),
    collate_fn=alignCollate())
train_iter = iter(train_loader)
cpu_images, cpu_texts, cpu_lengths = next(train_iter)

ERROR message:

ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
                         self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
                         self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
  turn _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
                         fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
  turn reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
  turn recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get  
    return _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError:
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError:
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
                         self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
                         self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
  turn _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
                         fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
  turn reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
  turn recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:

System Info

  • PyTorch or Caffe2: Pytorch
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • OS: centos 7
  • PyTorch version: 0.4.0
  • Python version: 3.6
  • CUDA/cuDNN version: cuda9.0/cudnn7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions