Metal-MSM v2 executes MSM on BN254 curve on Apple GPUs using Metal Shading Language (MSL). Unlike v1, which naively split the work into smaller tasks, v2 takes Tal and Koh’s WebGPU MSM in ZPrize2023 and the cuZK [LWY+23] approach as reference.
By adopting sparse matrices, it improves the Pippenger algorithm Pip76 with a more memory-efficient storage format and uses well-studied sparse matrix algorithms, such as sparse matrix–vector multiplication and sparse matrix transposition, in both the preprocessing phase (e.g., radix sort via sparse matrix transpose) and the bucket-accumulation phase to achieve high parallelism.
We took the WebGPU MSM reference and tuned it for all scales by auto-adjusting workgroup sizes for each cuZK shaders with SIMD width and the amount of GPU cores, squeezing out better GPU utilization. Plus, with dynamic window sizes, we speed up small and medium inputs (2^14 – 2^18) by eliminating unused sparse-matrix columns.
One thing to highlight is that our implementation runs most computations on the GPU, but it’s still slower than the CPU-only solution like Arkworks. However, because we target client-side devices with limited resources, applying a hybrid approach, leveraging both CPU and GPU for MSM tasks and combining the results at the end, can yield an implementation slightly faster than a pure-CPU one. Check the write-up below for estimated speedups with this hybrid method.
Metal MSM v2 works with arkworks v0.4.x
; just include the crate in your Cargo.toml
.
mopro-msm = { git = "https://github.com/zkmopro/gpu-acceleration.git", tag = "v0.2.0" }
Next, invoke MSM within your Rust code.
use mopro_msm::msm::metal_msm::{
metal_variable_base_msm,
test_utils::generate_random_bases_and_scalars, // optional
};
fn main() {
let input_size = 1 << 16;
let (bases, scalars) = generate_random_bases_and_scalars(input_size);
let msm_result = metal_variable_base_msm(&bases, &scalars);
println!("Result: {:?}", msm_result);
}
Because it’s compatible with Arkworks, you can seamlessly swap between Metal MSM and the Arkworks MSM implementation.
#[cfg(test)]
mod tests {
use super::*;
use ark_bn254::{Fr as ScalarField, G1Projective as G};
use ark_ec::{CurveGroup, VariableBaseMSM};
use ark_std::{UniformRand, test_rng};
#[test]
fn test_msm() {
let input_size = 1 << 10;
// Generate random EC points and scalars with Arkworks
let mut rng = test_rng();
let bases = (0..input_size)
.map(|_| G::rand(&mut rng).into_affine())
.collect::<Vec<_>>();
let scalars = (0..input_size)
.map(|_| ScalarField::rand(&mut rng))
.collect::<Vec<_>>();
let metal_msm_result = metal_variable_base_msm(&bases, &scalars).unwrap();
let arkworks_msm_result = G::msm(&bases, &scalars).unwrap();
assert_eq!(metal_msm_result, arkworks_msm_result); // the result is the same
}
}
Benchmarking on BN254 curve ran on a MacBook Air with M3 chips, with test case setup time excluded.
Scheme | Input Size (ms) | ||||||
---|---|---|---|---|---|---|---|
212 | 214 | 216 | 218 | 220 | 222 | 224 | |
Arkworks v0.4.x (CPU, Baseline) |
6 | 19 | 69 | 245 | 942 | 3,319 | 14,061 |
Metal MSM v0.1.0 (GPU) |
143 (-23.8x) |
273 (-14.4x) |
1,730 (-25.1x) |
10,277 (-41.9x) |
41,019 (-43.5x) |
555,877 (-167.5x) |
N/A |
Metal MSM v0.2.0 (GPU) |
134 (-22.3x) |
124 (-6.5x) |
253 (-3.7x) |
678 (-2.8x) |
1,702 (-1.8x) |
5,390 (-1.6x) |
22,241 (-1.6x) |
ICME WebGPU MSM (GPU) |
N/A | N/A | 2,719 (-39.4x) |
5,418 (-22.1x) |
17,475 (-18.6x) |
N/A | N/A |
ICICLE-Metal v3.8.0 (GPU) |
59 (-9.8x) |
54 (-2.8x) |
89 (-1.3x) |
149 (+1.6x) |
421 (+2.2x) |
1,288 (+2.6x) |
4,945 (+2.8x) |
side note:
- for ICME WebGPU MSM, input size 2^12 causes M3 chip machines to crash; any sizes not listed on the project’s GitHub page are shown as "N/A"
- for Metal MSM v0.1.0, the 2^24 benchmark was abandoned because it exceeded practical runtime
Environment: M1 Pro, macOS 15.2, curve ark_bn254
, dataset 2^20 unless stated. Medians of 5 runs.
metric | v11 | v22 | gain |
---|---|---|---|
end-to-end latency | 10.3 s | 0.42 s | ×24 |
GPU occupancy | 32 % | 76 % | +44 pp |
CPU share | 19 % | <3 % | –16 pp |
peak VRAM | 1.6 GB | 220 MB | –7.3× |
Key changes:
- single sparse-matrix kernel eliminates most launches and memory thrash
- CSR buckets keep data on-device → near-zero host↔GPU traffic
- on-GPU radix sort makes preprocessing parallel
- Modern Dependencies: Update to
objc2
andobjc2-metal
(objc2) - Metal 4: Adopt latest Metal 4 features
- Refactor with SIMD in mind:
- Instruction-level parallelism using vector types for faster FMA within SIMD groups
- Memory coalescing to increase locality (e.g., structure of array instead of array of structure)
- Optimized input reading patterns (e.g.
[X_i || Y_i]_0^{n-1}
instead of separate arrays) - Latency hiding and occupancy fine-tuning
- Minimize thread divergence
- CPU-GPU Hybrid: Research interleaving with CPU MSM crate and update to
arkworks 0.5
- Advanced Algorithms:
- Elastic MSM [ZHY+24] implementation
- Faster modular reduction with LogJump (article by Wei Jie, Barret-Montgomery)
- Cross-platform: WGSL support with native execution environment
- Crypto Math Library: Maintain a Metal/WebGPU crypto math library
This work was initially sponsored by a joint grant from PSE and 0xPARC. It is currently incubated by PSE.