NNPACK slow down M1/M2 Mac CPU

### 🐛 Describe the bug

On ARM Mac (M2 I'm using), torch>=1.12.0 is slower than torch<=1.11.0 by more than an order of magnitude. Appears that from 1.12.0 onward, NNPACK is enabled on these device architectures, but instead of optimizing it slowed things down.

```python
import torch
import torch.nn.functional as F
device = torch.device("cpu")
inputs = torch.randn(3300, 16, 30).to(device)
filters = torch.randn(20, 16, 5).to(device)

with torch.autograd.profiler.profile() as prof:
    output = F.conv1d(inputs, filters)

print(prof)
```

For torch==1.11.0 (and a few other lower versions I tried)
```
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::conv1d         0.15%       2.000us       100.00%       1.310ms       1.310ms             1  
             aten::convolution         0.31%       4.000us        99.85%       1.308ms       1.308ms             1  
            aten::_convolution         0.69%       9.000us        99.54%       1.304ms       1.304ms             1  
               aten::unsqueeze         0.23%       3.000us         0.31%       4.000us       4.000us             1  
              aten::as_strided         0.08%       1.000us         0.08%       1.000us       1.000us             1  
               aten::unsqueeze         0.00%       0.000us         0.46%       6.000us       6.000us             1  
              aten::as_strided         0.46%       6.000us         0.46%       6.000us       6.000us             1  
             aten::thnn_conv2d         0.23%       3.000us        97.79%       1.281ms       1.281ms             1  
    aten::_slow_conv2d_forward        96.79%       1.268ms        97.56%       1.278ms       1.278ms             1  
                   aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                    aten::view         0.15%       2.000us         0.15%       2.000us       2.000us             1  
                   aten::empty         0.08%       1.000us         0.08%       1.000us       1.000us             1  
                 aten::resize_         0.53%       7.000us         0.53%       7.000us       7.000us             1  
                 aten::squeeze         0.15%       2.000us         0.31%       4.000us       4.000us             1  
              aten::as_strided         0.15%       2.000us         0.15%       2.000us       2.000us             1  
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.310ms
```

But for torch==1.12.0 (and higher versions, including 2.0.1)
```
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         aten::conv1d         0.01%       3.000us       100.00%      47.969ms      47.969ms             1  
                    aten::convolution         0.01%       4.000us        99.99%      47.966ms      47.966ms             1  
                   aten::_convolution         0.04%      17.000us        99.99%      47.962ms      47.962ms             1  
                      aten::unsqueeze         0.01%       3.000us         0.01%       4.000us       4.000us             1  
                     aten::as_strided         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                      aten::unsqueeze         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                     aten::as_strided         0.00%       0.000us         0.00%       0.000us       0.000us             1  
              aten::_nnpack_available         0.00%       0.000us         0.00%       0.000us       0.000us             1  
    aten::_nnpack_spatial_convolution        99.92%      47.933ms        99.93%      47.936ms      47.936ms             1  
                          aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                          aten::zeros         0.00%       1.000us         0.01%       3.000us       3.000us             1  
                          aten::empty         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                          aten::zero_         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                        aten::squeeze         0.00%       2.000us         0.01%       5.000us       5.000us             1  
                     aten::as_strided         0.01%       3.000us         0.01%       3.000us       3.000us             1  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 47.969ms
```

Building from source with "USE_NNPACK=0 python setup.py install" disabled NNPACK and resulted in the same output as torch=1.11.0 with time~1.3ms.

If this is consistent across all Mac ARM CPUs, perhaps could disable NNPACK on these architectures by default? Or perhaps better backends for these M1/M2 cpus?

### Versions

OS: macOS 13.4 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.27.3
Libc version: N/A

Python version: 3.8.17 (default, Jul  5 2023, 15:35:58)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.4-arm64-arm-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Apple M2 Max

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==1.12.0
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NNPACK slow down M1/M2 Mac CPU #107534

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NNPACK slow down M1/M2 Mac CPU #107534

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions