Skip to content

NNPACK slow down M1/M2 Mac CPU #107534

@kmzzhang

Description

@kmzzhang

🐛 Describe the bug

On ARM Mac (M2 I'm using), torch>=1.12.0 is slower than torch<=1.11.0 by more than an order of magnitude. Appears that from 1.12.0 onward, NNPACK is enabled on these device architectures, but instead of optimizing it slowed things down.

import torch
import torch.nn.functional as F
device = torch.device("cpu")
inputs = torch.randn(3300, 16, 30).to(device)
filters = torch.randn(20, 16, 5).to(device)

with torch.autograd.profiler.profile() as prof:
    output = F.conv1d(inputs, filters)

print(prof)

For torch==1.11.0 (and a few other lower versions I tried)

------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::conv1d         0.15%       2.000us       100.00%       1.310ms       1.310ms             1  
             aten::convolution         0.31%       4.000us        99.85%       1.308ms       1.308ms             1  
            aten::_convolution         0.69%       9.000us        99.54%       1.304ms       1.304ms             1  
               aten::unsqueeze         0.23%       3.000us         0.31%       4.000us       4.000us             1  
              aten::as_strided         0.08%       1.000us         0.08%       1.000us       1.000us             1  
               aten::unsqueeze         0.00%       0.000us         0.46%       6.000us       6.000us             1  
              aten::as_strided         0.46%       6.000us         0.46%       6.000us       6.000us             1  
             aten::thnn_conv2d         0.23%       3.000us        97.79%       1.281ms       1.281ms             1  
    aten::_slow_conv2d_forward        96.79%       1.268ms        97.56%       1.278ms       1.278ms             1  
                   aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                    aten::view         0.15%       2.000us         0.15%       2.000us       2.000us             1  
                   aten::empty         0.08%       1.000us         0.08%       1.000us       1.000us             1  
                 aten::resize_         0.53%       7.000us         0.53%       7.000us       7.000us             1  
                 aten::squeeze         0.15%       2.000us         0.31%       4.000us       4.000us             1  
              aten::as_strided         0.15%       2.000us         0.15%       2.000us       2.000us             1  
------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.310ms

But for torch==1.12.0 (and higher versions, including 2.0.1)

-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         aten::conv1d         0.01%       3.000us       100.00%      47.969ms      47.969ms             1  
                    aten::convolution         0.01%       4.000us        99.99%      47.966ms      47.966ms             1  
                   aten::_convolution         0.04%      17.000us        99.99%      47.962ms      47.962ms             1  
                      aten::unsqueeze         0.01%       3.000us         0.01%       4.000us       4.000us             1  
                     aten::as_strided         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                      aten::unsqueeze         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                     aten::as_strided         0.00%       0.000us         0.00%       0.000us       0.000us             1  
              aten::_nnpack_available         0.00%       0.000us         0.00%       0.000us       0.000us             1  
    aten::_nnpack_spatial_convolution        99.92%      47.933ms        99.93%      47.936ms      47.936ms             1  
                          aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                          aten::zeros         0.00%       1.000us         0.01%       3.000us       3.000us             1  
                          aten::empty         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                          aten::zero_         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                        aten::squeeze         0.00%       2.000us         0.01%       5.000us       5.000us             1  
                     aten::as_strided         0.01%       3.000us         0.01%       3.000us       3.000us             1  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 47.969ms

Building from source with "USE_NNPACK=0 python setup.py install" disabled NNPACK and resulted in the same output as torch=1.11.0 with time~1.3ms.

If this is consistent across all Mac ARM CPUs, perhaps could disable NNPACK on these architectures by default? Or perhaps better backends for these M1/M2 cpus?

Versions

OS: macOS 13.4 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.27.3
Libc version: N/A

Python version: 3.8.17 (default, Jul 5 2023, 15:35:58) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.4-arm64-arm-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Apple M2 Max

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==1.12.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: nnpackRelated to our NNPack integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions