Implement Metal Backend for macOS #1039

ChinChangYang · 2025-03-24T01:13:23Z

This PR introduces a GPU backend for macOS using Metal, enabling full support for distributed training in KataGo. This addition benefits macOS users who wish to contribute to KataGo training and self-play.

Verification

The Metal backend has been tested using the testgpuerror command, with results demonstrating minimal floating-point error compared to the reference implementation:

: fp32 error vs reference winrateError:                        0.00003%  0.00007%  0.00017%  0.00041%
: fp32 error vs reference leadError:                           0.00001   0.00002   0.00007   0.00024
: fp32 error vs reference scoreMeanError:                      0.00001   0.00002   0.00007   0.00023
: fp32 error vs reference scoreStdevError:                     0.00000   0.00001   0.00003   0.00006
: fp32 error vs reference topPolicyDelta:                      0.00004%  0.00009%  0.00017%  0.00041%
: fp32 error vs reference policyKLDiv:                         0.000000  0.000000  0.000000  0.000000
: fp32 error vs reference stWLErrorError:                      0.00000c  0.00000c  0.00000c  0.00000c
: fp32 error vs reference stScErrorError:                      0.00000   0.00000   0.00000   0.00000
: fp32 error vs reference ownershipError:                      0.00002c  0.00004c  0.00012c  0.00084c
: batched fp32 error vs reference winrateError:                0.00003%  0.00008%  0.00017%  0.00043%
: batched fp32 error vs reference leadError:                   0.00001   0.00002   0.00008   0.00022
: batched fp32 error vs reference scoreMeanError:              0.00001   0.00002   0.00007   0.00021
: batched fp32 error vs reference scoreStdevError:             0.00000   0.00001   0.00003   0.00006
: batched fp32 error vs reference topPolicyDelta:              0.00004%  0.00009%  0.00017%  0.00043%
: batched fp32 error vs reference policyKLDiv:                 0.000000  0.000000  0.000000  0.000000
: batched fp32 error vs reference stWLErrorError:              0.00000c  0.00000c  0.00000c  0.00000c
: batched fp32 error vs reference stScErrorError:              0.00000   0.00000   0.00000   0.00000
: batched fp32 error vs reference ownershipError:              0.00002c  0.00004c  0.00012c  0.00090c
: current cfg error vs reference winrateError:                 0.00003%  0.00007%  0.00017%  0.00041%
: current cfg error vs reference leadError:                    0.00001   0.00002   0.00007   0.00024
: current cfg error vs reference scoreMeanError:               0.00001   0.00002   0.00007   0.00023
: current cfg error vs reference scoreStdevError:              0.00000   0.00001   0.00003   0.00006
: current cfg error vs reference topPolicyDelta:               0.00004%  0.00009%  0.00017%  0.00041%
: current cfg error vs reference policyKLDiv:                  0.000000  0.000000  0.000000  0.000000
: current cfg error vs reference stWLErrorError:               0.00000c  0.00000c  0.00000c  0.00000c
: current cfg error vs reference stScErrorError:               0.00000   0.00000   0.00000   0.00000
: current cfg error vs reference ownershipError:               0.00002c  0.00004c  0.00012c  0.00084c
: batched current cfg error vs reference winrateError:         0.00003%  0.00008%  0.00017%  0.00043%
: batched current cfg error vs reference leadError:            0.00001   0.00002   0.00008   0.00022
: batched current cfg error vs reference scoreMeanError:       0.00001   0.00002   0.00007   0.00021
: batched current cfg error vs reference scoreStdevError:      0.00000   0.00001   0.00003   0.00006
: batched current cfg error vs reference topPolicyDelta:       0.00004%  0.00009%  0.00017%  0.00043%
: batched current cfg error vs reference policyKLDiv:          0.000000  0.000000  0.000000  0.000000
: batched current cfg error vs reference stWLErrorError:       0.00000c  0.00000c  0.00000c  0.00000c
: batched current cfg error vs reference stScErrorError:       0.00000   0.00000   0.00000   0.00000
: batched current cfg error vs reference ownershipError:       0.00002c  0.00004c  0.00012c  0.00090c

The Metal backend is extracted from the CoreML backend. This removes all of the CoreML process and leaves Metal backend for maximizing compatibility to KataGo distributed training.

- Updated project definition to conditionally include languages (CXX, Swift) based on the USE_BACKEND variable.

- Updated the `processScoreValues` function to use `modelVersion` instead of a generic `version`. - Refactored score value data extraction with proper assertions based on model version. - Removed unused variables related to score values and cleaned up memory management for better efficiency and clarity.

Added `-Wno-cast-qual` and `-Wno-c11-extensions` to the CMAKE_CXX_FLAGS for AppleClang builds. This will suppress specific warnings related to casting and C11 features, providing a cleaner build output while maintaining existing warnings for other compilers.

This commit introduces a new section in the Compiling.md file that provides detailed instructions for compiling KataGo on macOS. It includes prerequisites, installation commands, and specific recommendations for backend configuration, ensuring users have a clear guide to successfully compile and run KataGo on Mac systems.

- Removed redundant model dimensions (modelXLen and modelYLen) as they were replaced with nnXLen and nnYLen. - Consolidated policy result element variables, simplifying their calculations and reducing memory footprint. - Removed unused modelPolicyResultChannels from InputBuffers structure.

lightvector

I've looked over the code now. Thanks for writing this and getting this all tested! I left some questions as individual inline comments, and also have a few further overall questions:

Would it be possible to move the cpp/macos folder to cpp/external/macos and update any relevant references and paths to work properly with it there? That way, externally licensed code is better contained in one place to the cpp/external folder.
Since I've looked at the code now, from here forward, would it be possible to avoid rebase/force-pushing when updating the branch? That way, it's easier to review because I can better follow incremental diffs, without the commits of the entire branch seeming to change from Git's perspective.
I see that one of the commits pushed was to dehardcode the FP32 data type in places. However, am I correct that this does not go the full way to supporting FP16? Given that the testing for Metal so far does not include testing for FP16 and the amount of error it introduces for this backend, I would actually prefer that users cannot activate FP16 in the backend ("auto" should choose FP32 and explicit setting FP16 should be rejected), and that the generalization of data types cannot result in Metal somehow on its own deciding to use FP16 suddenly depending on the device. I wanted to confirm - is this the case right now?

Thanks!

lightvector · 2025-04-07T17:58:26Z

cpp/neuralnet/metalbackend.swift

+    /// This function applies the Mish activation function on the input tensor `x`. The Mish function is defined as
+    /// x * tanh(Softplus(x)), where Softplus(x) is defined as log(1 + exp(min(x, 10.39))) if x < 10.39 and x otherwise.
+    /// The threshold of softplus is modified to 10.39, which is different from the original 20. This is because
+    /// exp(10.39) = 32532.666936 < 32767.0 < 65504.0, so the result of exp(10.39) can be represented by float16. If the threshold
+    /// of softplus is 20, the result of exp(20) is 485165195.40979004, which is out of range of float16.


Where in the code is this threshold of 10.39 or 20 being applied?

The current logic seems to just unconditionally apply exp and log. How does it avoid floating point overflow?

A threshold value of 10.39 or 20 was used in a previous revision, during which FP16 was supported in the Metal backend. In the current implementation, however, the logic unconditionally applies exp and log operations for FP32 data. Because the KataGo model normalizes inputs toward one, removing the threshold from the mish activation does not result in significant error. On the contrary, this change has led to a slight performance gain in the Metal backend, while maintaining very low numerical error.

The code comments regarding the threshold are retained above the mish function, as I may revisit FP16 support in the future to evaluate its impact on throughput and numerical accuracy in the Metal backend.

I have reintroduced the threshold in eda158f.

lightvector · 2025-04-07T18:25:43Z

cpp/neuralnet/metalbackend.h

+   * @brief The x length of the CoreML model.
+   */
+  int modelXLen = COMPILE_MAX_BOARD_LEN;
+
+  /**
+   * @brief The y length of the CoreML model.
+   */
+  int modelYLen = COMPILE_MAX_BOARD_LEN;
+


Are these unused now?

Thanks for finding these unused variables. I have remove them from the source code 86ab38f.

lightvector · 2025-04-07T18:39:17Z

cpp/neuralnet/metalbackend.cpp

+ * @param W Width.
+ * @param inputsUseNHWC Flag indicating if the input data is currently in NHWC format.
+ */
+void MetalProcess::convertNCHW(


I think I prefer the previous version of the code, where there was:

if(inputsUseNHWC != false) { throw StringError("Metal backend: inputsUseNHWC = false required, other configurations not supported"); }

It means that we don't have the additional complexity of conversion here and it reduces the number of configurations to test. The KataGo engine above this is perfectly capable of specifying the input directly in NCHW format so that conversion is not needed, and there is no meaningful end-user functionality that is enabled by allowing KataGo's engine to supply a less efficient format and then to have the backend run an additional routine to convert it.

To make the tests still easy to run for the ones that try to set NHWC, the only additional thing that needs to be added I think is to tests/testcommon.cpp, in TestCommon::overrideForBackends, an additional case:

#elif defined(USE_METAL_BACKEND) if(inputsNHWC != false) { cout << "Backend is Metal, ignoring args and forcing inputsNHWC=false" << endl; inputsNHWC = false; } if(useNHWC != false) { cout << "Backend is Metal, ignoring args and forcing useNHWC=false" << endl; useNHWC = false; }

I'm cautious about adding a condition for the Metal backend in tests/testcommon.cpp. Although it would be logically valid, doing so introduces Metal-specific logic into a shared test file that I don’t primarily maintain. My main responsibility lies in metalbackend.{h,cpp,swift}, and I aim to keep changes outside of those files minimal. This approach helps keep the Metal backend modular for my maintenance.

Regarding test coverage, I can run your test cases to ensure both NCHW and NHWC formats are exercised. From my perspective, this should be sufficient and does not raise any concerns.

Additionally, since the Metal API natively supports the NHWC layout, the Metal backend has the potential to deliver better performance in future optimizations.

Okay cool. So let's leave this aspect alone for now.

There are actually an uncomfortable number of places in the current KataGo code outside of the backend code where inputsNHWC and possibly one or two other settings are hardcoded based on backend-specific knowledge as to the preferred orientation. So, I propose doing the following in a future version (not the coming release).

I add a direct way in nninterface.h that all backends can report...

Their default/preferred setting for this and any other similar parameters

Which settings are supported, even if not preferred.

Metal can then use this interface to report to the backend-agonistic part of KataGo that inputsNCHW is the preferred format (skipping the conversion of the format). Optionally, you could also then simplify the code by removing the conversion code, because the same interface can be used to inform KataGo that NHWC is not supported, in a backend-general way, and it will be a reliable part of the contract that then KataGo will never pass that format down to the backend.

If later you implement optimized support for NHWC as the faster format (which presumably will handle it directly rather than use the conversion logic), it will be easy to just change what the backend returns in that interface to report that NHWC is now supported and is the preferred format instead.

In exactly the same way, OpenCL, TensorRT will also use this interface, and so everywhere we can remove the backend-specific hacks about input format.

Does that sound reasonable?

It sounds great!

ChinChangYang · 2025-04-08T01:05:36Z

Would it be possible to move the cpp/macos folder to cpp/external/macos and update any relevant references and paths to work properly with it there? That way, externally licensed code is better contained in one place to the cpp/external folder.

The cpp/macos has been moved to cpp/external/macos in cc6b624.

Since I've looked at the code now, from here forward, would it be possible to avoid rebase/force-pushing when updating the branch? That way, it's easier to review because I can better follow incremental diffs, without the commits of the entire branch seeming to change from Git's perspective.

I will merge changes from master branch without the use of rebase nor force-pushing.

I see that one of the commits pushed was to dehardcode the FP32 data type in places. However, am I correct that this does not go the full way to supporting FP16? Given that the testing for Metal so far does not include testing for FP16 and the amount of error it introduces for this backend, I would actually prefer that users cannot activate FP16 in the backend ("auto" should choose FP32 and explicit setting FP16 should be rejected), and that the generalization of data types cannot result in Metal somehow on its own deciding to use FP16 suddenly depending on the device. I wanted to confirm - is this the case right now?

Even though I dehardcode the FP32 data type in places, this does not support FP16 at this moment. In other words, FP16 is impossible to be enabled in this Metal backend at this moment.

This Metal backend does not support FP16 at this moment, so remove FP16 from configuration files for clarification.

ChinChangYang force-pushed the metal-backend branch from 2a9f3f0 to 26a488e Compare March 26, 2025 12:58

ChinChangYang force-pushed the metal-backend branch from 774bb60 to 152123a Compare April 4, 2025 13:46

ChinChangYang added 14 commits April 7, 2025 13:20

Create a Metal backend for MacOS

ff65519

The Metal backend is extracted from the CoreML backend. This removes all of the CoreML process and leaves Metal backend for maximizing compatibility to KataGo distributed training.

Update CMakeLists.txt for conditional backend support

4b0f674

- Updated project definition to conditionally include languages (CXX, Swift) based on the USE_BACKEND variable.

Update NeuralNet methods for model information retrieval

eaf4263

Fix a mask operation for Metal backend

4d74c1e

Throw error if inputsUseNHWC=true in Metal backend

5ea914c

Configurable data type for Metal backend

154aec7

Simplify mish activation function for Metal performance

d62ad7d

Format metalbackend.swift

9bf32d0

Update input channel assertions for Metal backend

fa8c29c

Support NHWC format for Metal backend

0ceb90f

ChinChangYang force-pushed the metal-backend branch from 152123a to 0ceb90f Compare April 7, 2025 09:27

lightvector reviewed Apr 7, 2025

View reviewed changes

ChinChangYang added 4 commits April 8, 2025 08:45

Revive the threshold for Metal FP32 mish

eda158f

Move cpp/macos to cpp/external/macos

cc6b624

Remove unused variables from Metal backend

86ab38f

Add Metal GPU settings to example configs

d0a323e

ChinChangYang added 2 commits April 8, 2025 17:27

Adjust threshold to 20 for Metal FP32 mish

49daaed

Remove metalUseFP16 from config files

4eb8fba

This Metal backend does not support FP16 at this moment, so remove FP16 from configuration files for clarification.

lightvector merged commit 593755f into lightvector:master Apr 8, 2025

ChinChangYang deleted the metal-backend branch April 8, 2025 13:34

BrewTestBot mentioned this pull request Apr 10, 2025

katago 1.16.0 Homebrew/homebrew-core#219197

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Metal Backend for macOS #1039

Implement Metal Backend for macOS #1039

Uh oh!

ChinChangYang commented Mar 24, 2025

Uh oh!

lightvector left a comment

Uh oh!

lightvector Apr 7, 2025

Uh oh!

ChinChangYang Apr 7, 2025

Uh oh!

ChinChangYang Apr 8, 2025

Uh oh!

lightvector Apr 7, 2025

Uh oh!

ChinChangYang Apr 8, 2025

Uh oh!

lightvector Apr 7, 2025

Uh oh!

ChinChangYang Apr 8, 2025

Uh oh!

lightvector Apr 8, 2025 •

edited

Loading

Uh oh!

ChinChangYang Apr 8, 2025

Uh oh!

ChinChangYang commented Apr 8, 2025

Uh oh!

Uh oh!

Implement Metal Backend for macOS #1039

Implement Metal Backend for macOS #1039

Uh oh!

Conversation

ChinChangYang commented Mar 24, 2025

Verification

Uh oh!

lightvector left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightvector Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChinChangYang commented Apr 8, 2025

Uh oh!

Uh oh!

lightvector Apr 8, 2025 •

edited

Loading