-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
🐛 Describe the bug
On ARM Mac (M2 I'm using), torch>=1.12.0 is slower than torch<=1.11.0 by more than an order of magnitude. Appears that from 1.12.0 onward, NNPACK is enabled on these device architectures, but instead of optimizing it slowed things down.
import torch
import torch.nn.functional as F
device = torch.device("cpu")
inputs = torch.randn(3300, 16, 30).to(device)
filters = torch.randn(20, 16, 5).to(device)
with torch.autograd.profiler.profile() as prof:
output = F.conv1d(inputs, filters)
print(prof)
For torch==1.11.0 (and a few other lower versions I tried)
------------------------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::conv1d 0.15% 2.000us 100.00% 1.310ms 1.310ms 1
aten::convolution 0.31% 4.000us 99.85% 1.308ms 1.308ms 1
aten::_convolution 0.69% 9.000us 99.54% 1.304ms 1.304ms 1
aten::unsqueeze 0.23% 3.000us 0.31% 4.000us 4.000us 1
aten::as_strided 0.08% 1.000us 0.08% 1.000us 1.000us 1
aten::unsqueeze 0.00% 0.000us 0.46% 6.000us 6.000us 1
aten::as_strided 0.46% 6.000us 0.46% 6.000us 6.000us 1
aten::thnn_conv2d 0.23% 3.000us 97.79% 1.281ms 1.281ms 1
aten::_slow_conv2d_forward 96.79% 1.268ms 97.56% 1.278ms 1.278ms 1
aten::empty 0.00% 0.000us 0.00% 0.000us 0.000us 1
aten::view 0.15% 2.000us 0.15% 2.000us 2.000us 1
aten::empty 0.08% 1.000us 0.08% 1.000us 1.000us 1
aten::resize_ 0.53% 7.000us 0.53% 7.000us 7.000us 1
aten::squeeze 0.15% 2.000us 0.31% 4.000us 4.000us 1
aten::as_strided 0.15% 2.000us 0.15% 2.000us 2.000us 1
------------------------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.310ms
But for torch==1.12.0 (and higher versions, including 2.0.1)
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::conv1d 0.01% 3.000us 100.00% 47.969ms 47.969ms 1
aten::convolution 0.01% 4.000us 99.99% 47.966ms 47.966ms 1
aten::_convolution 0.04% 17.000us 99.99% 47.962ms 47.962ms 1
aten::unsqueeze 0.01% 3.000us 0.01% 4.000us 4.000us 1
aten::as_strided 0.00% 1.000us 0.00% 1.000us 1.000us 1
aten::unsqueeze 0.00% 0.000us 0.00% 0.000us 0.000us 1
aten::as_strided 0.00% 0.000us 0.00% 0.000us 0.000us 1
aten::_nnpack_available 0.00% 0.000us 0.00% 0.000us 0.000us 1
aten::_nnpack_spatial_convolution 99.92% 47.933ms 99.93% 47.936ms 47.936ms 1
aten::empty 0.00% 0.000us 0.00% 0.000us 0.000us 1
aten::zeros 0.00% 1.000us 0.01% 3.000us 3.000us 1
aten::empty 0.00% 1.000us 0.00% 1.000us 1.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 1
aten::squeeze 0.00% 2.000us 0.01% 5.000us 5.000us 1
aten::as_strided 0.01% 3.000us 0.01% 3.000us 3.000us 1
------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 47.969ms
Building from source with "USE_NNPACK=0 python setup.py install" disabled NNPACK and resulted in the same output as torch=1.11.0 with time~1.3ms.
If this is consistent across all Mac ARM CPUs, perhaps could disable NNPACK on these architectures by default? Or perhaps better backends for these M1/M2 cpus?
Versions
OS: macOS 13.4 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.27.3
Libc version: N/A
Python version: 3.8.17 (default, Jul 5 2023, 15:35:58) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.4-arm64-arm-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Apple M2 Max
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==1.12.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi