-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Closed
Labels
Description
I got segmentation fault when trying to twice differentiate BatchNorm2d. A simple example to reproduce the error is the network:
BatchNorm2d --> Linear --> exp --> sum
Removing either BatchNorm2d or exp fixes the problem.
I am on the master branch, using Python 2.7, cuda 8.0, cudnn 6.0. The error can be reproduced with the following code:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class BatchNormTest(nn.Module):
def __init__(self, num_classes=2):
super(BatchNormTest, self).__init__()
self.bn = nn.BatchNorm2d(3)
self.linear = nn.Linear(3*4*4, num_classes)
def forward(self, x):
out = x
# the following line leads to SEGFAULT
# no SEGFAULT when commented out
out = self.bn(out)
out = out.view(out.size(0), -1)
out = self.linear(out)
return out
b = 4
net = BatchNormTest()
use_cuda = True
inputs = Variable(torch.rand(b,3,4,4), requires_grad=True)
if use_cuda:
net.cuda()
inputs = inputs.cuda()
output = net(inputs)
# this line leads to SEGFAULT
loss1 = torch.sum(torch.exp(output))
## whereas this line does not
# loss1 = torch.sum(output)
grad_params = torch.autograd.grad(loss1, inputs, create_graph=True)
grad = grad_params[0]
loss = torch.sum(grad)
loss.backward()
gdb information:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffb698d700 (LWP 701)]
torch::autograd::BatchNormBackward::apply (this=0x4edfc718, grad_outputs=...) at torch/csrc/autograd/functions/batch_normalization.cpp:177
warning: Source file is more recent than executable.
177 grad_weight,
(gdb) where
#0 torch::autograd::BatchNormBackward::apply (this=0x4edfc718, grad_outputs=...) at torch/csrc/autograd/functions/batch_normalization.cpp:177
#1 0x00007fffecf0a392 in call_function (task=...) at torch/csrc/autograd/engine.cpp:162
#2 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffedf93b00 <engine>, task=...) at torch/csrc/autograd/engine.cpp:167
#3 0x00007fffecf0bf39 in torch::autograd::Engine::thread_main (this=this@entry=0x7fffedf93b00 <engine>, queue=..., device=device@entry=0) at torch/csrc/autograd/engine.cpp:117
#4 0x00007fffecf27d1a in PythonEngine::thread_main (this=0x7fffedf93b00 <engine>, queue=..., device=0) at torch/csrc/autograd/python_engine.cpp:23
#5 0x00007fffecf106ee in operator()<std::shared_ptr<torch::autograd::ReadyQueue>, int, void> (__object=<optimized out>, this=<optimized out>)
at /private/home/hongyizmit/.conda/envs/torchmaster/gcc/include/c++/functional:601
#6 _M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /private/home/hongyizmit/.conda/envs/torchmaster/gcc/include/c++/functional:1732
#7 operator() (this=<optimized out>) at /private/home/hongyizmit/.conda/envs/torchmaster/gcc/include/c++/functional:1720
#8 std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (torch::autograd::Engine::*)(std::shared_ptr<torch::autograd::ReadyQueue>, int)> (torch::autograd::Engine*, std::shared_ptr<torch::autograd::ReadyQueue>, int)
> >::_M_run() (this=<optimized out>) at /private/home/hongyizmit/.conda/envs/torchmaster/gcc/include/c++/thread:115
#9 0x00007fffcf81d260 in ?? () from /private/home/hongyizmit/.conda/envs/torchmaster/lib/libstdc++.so.6
#10 0x00007ffff77c8184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0