Skip to content

zkmopro/gpu-acceleration

Repository files navigation

Metal MSM

Metal-MSM v2 executes MSM on BN254 curve on Apple GPUs using Metal Shading Language (MSL). Unlike v1, which naively split the work into smaller tasks, v2 takes Tal and Koh’s WebGPU MSM in ZPrize2023 and the cuZK [LWY+23] approach as reference.

By adopting sparse matrices, it improves the Pippenger algorithm Pip76 with a more memory-efficient storage format and uses well-studied sparse matrix algorithms, such as sparse matrix–vector multiplication and sparse matrix transposition, in both the preprocessing phase (e.g., radix sort via sparse matrix transpose) and the bucket-accumulation phase to achieve high parallelism.

We took the WebGPU MSM reference and tuned it for all scales by auto-adjusting workgroup sizes for each cuZK shaders with SIMD width and the amount of GPU cores, squeezing out better GPU utilization. Plus, with dynamic window sizes, we speed up small and medium inputs (2^14 – 2^18) by eliminating unused sparse-matrix columns.

One thing to highlight is that our implementation runs most computations on the GPU, but it’s still slower than the CPU-only solution like Arkworks. However, because we target client-side devices with limited resources, applying a hybrid approach, leveraging both CPU and GPU for MSM tasks and combining the results at the end, can yield an implementation slightly faster than a pure-CPU one. Check the write-up below for estimated speedups with this hybrid method.

How to use

Metal MSM v2 works with arkworks v0.4.x; just include the crate in your Cargo.toml.

mopro-msm = { git = "https://github.com/zkmopro/gpu-acceleration.git", tag = "v0.2.0" }

Next, invoke MSM within your Rust code.

use mopro_msm::msm::metal_msm::{
    metal_variable_base_msm,
    test_utils::generate_random_bases_and_scalars,    // optional
};

fn main() {
    let input_size = 1 << 16;
    let (bases, scalars) = generate_random_bases_and_scalars(input_size);
    let msm_result = metal_variable_base_msm(&bases, &scalars);

    println!("Result: {:?}", msm_result);
}

Because it’s compatible with Arkworks, you can seamlessly swap between Metal MSM and the Arkworks MSM implementation.

#[cfg(test)]
mod tests {
    use super::*;
    use ark_bn254::{Fr as ScalarField, G1Projective as G};
    use ark_ec::{CurveGroup, VariableBaseMSM};
    use ark_std::{UniformRand, test_rng};

    #[test]
    fn test_msm() {
        let input_size = 1 << 10;

        // Generate random EC points and scalars with Arkworks
        let mut rng = test_rng();
        let bases = (0..input_size)
            .map(|_| G::rand(&mut rng).into_affine())
            .collect::<Vec<_>>();
        let scalars = (0..input_size)
            .map(|_| ScalarField::rand(&mut rng))
            .collect::<Vec<_>>();

        let metal_msm_result = metal_variable_base_msm(&bases, &scalars).unwrap();
        let arkworks_msm_result = G::msm(&bases, &scalars).unwrap();

        assert_eq!(metal_msm_result, arkworks_msm_result);    // the result is the same
    }
}

Benchmark

Benchmarking on BN254 curve ran on a MacBook Air with M3 chips, with test case setup time excluded.

Scheme Input Size (ms)
212 214 216 218 220 222 224
Arkworks v0.4.x
(CPU, Baseline)
6 19 69 245 942 3,319 14,061
Metal MSM v0.1.0
(GPU)
143
(-23.8x)
273
(-14.4x)
1,730
(-25.1x)
10,277
(-41.9x)
41,019
(-43.5x)
555,877
(-167.5x)
N/A
Metal MSM v0.2.0
(GPU)
134
(-22.3x)
124
(-6.5x)
253
(-3.7x)
678
(-2.8x)
1,702
(-1.8x)
5,390
(-1.6x)
22,241
(-1.6x)
ICME WebGPU MSM
(GPU)
N/A N/A 2,719
(-39.4x)
5,418
(-22.1x)
17,475
(-18.6x)
N/A N/A
ICICLE-Metal v3.8.0
(GPU)
59
(-9.8x)
54
(-2.8x)
89
(-1.3x)
149
(+1.6x)
421
(+2.2x)
1,288
(+2.6x)
4,945
(+2.8x)

side note:

  • for ICME WebGPU MSM, input size 2^12 causes M3 chip machines to crash; any sizes not listed on the project’s GitHub page are shown as "N/A"
  • for Metal MSM v0.1.0, the 2^24 benchmark was abandoned because it exceeded practical runtime

Profiling summary (v1 vs v2)

Environment: M1 Pro, macOS 15.2, curve ark_bn254, dataset 2^20 unless stated. Medians of 5 runs.

v2 → v1

metric v11 v22 gain
end-to-end latency 10.3 s 0.42 s ×24
GPU occupancy 32 % 76 % +44 pp
CPU share 19 % <3 % –16 pp
peak VRAM 1.6 GB 220 MB –7.3×

Key changes:

  • single sparse-matrix kernel eliminates most launches and memory thrash
  • CSR buckets keep data on-device → near-zero host↔GPU traffic
  • on-GPU radix sort makes preprocessing parallel

Benchmarking (how-to)

Metal-MSM v0.2.0 ships two Criterion benches:

Command What it measures
cargo bench -p mopro-msm --bench shaders -- --sample-size 100 --warm-up-time 3 --measurement-time 10 Pure GPU time of each individual Metal shader (decompose, transpose, SMVP, PBPR …) across several input sizes. Host work is minimal so numbers reflect kernel performance.
cargo bench -p mopro-msm --bench e2e -- --sample-size 100 --warm-up-time 3 --measurement-time 10 Full metal_variable_base_msm pipeline – includes host-side preprocessing and postprocessing. Good for an overall sanity check, less ideal for micro-optimisation.

Flags after the double dash (--) are passed straight to Criterion, letting you tweak sample count or timing budget. Typical knobs:

  • --sample-size <n> – how many timed iterations to record (default 50).
  • --warm-up-time <seconds> – CPU/GPU warm-up before sampling starts.
  • --measurement-time <seconds> – total timed run per benchmark case.

The shaders bench uses the same logarithmic dataset sizes as the e2e bench (2^10, 2^12, 2^16). Modify LOG_SIZES inside benches/shaders.rs if you need different scales.

When tuning kernels focus on the shaders bench first; improvements there should propagate to the end-to-end numbers automatically.

Future

Technical Improvements

  • Modern Dependencies: Update to objc2 and objc2-metal (objc2)
  • Metal 4: Adopt latest Metal 4 features
  • Refactor with SIMD in mind:
    • Instruction-level parallelism using vector types for faster FMA within SIMD groups
    • Memory coalescing to increase locality (e.g., structure of array instead of array of structure)
    • Optimized input reading patterns (e.g. [X_i || Y_i]_0^{n-1} instead of separate arrays)
    • Latency hiding and occupancy fine-tuning
    • Minimize thread divergence

Algorithm & Integration

  • CPU-GPU Hybrid: Research interleaving with CPU MSM crate and update to arkworks 0.5
  • Advanced Algorithms:

Platform Expansion

  • Cross-platform: WGSL support with native execution environment
  • Crypto Math Library: Maintain a Metal/WebGPU crypto math library

Community

  • X account:
  • Telegram group:

Acknowledgements

This work was initially sponsored by a joint grant from PSE and 0xPARC. It is currently incubated by PSE.

Footnotes

  1. https://hackmd.io/@yaroslav-ya/rJkpqc_Nke

  2. https://hackmd.io/@yaroslav-ya/HyFA7XAQll

About

Explorations on mobile-first GPU acceleration, currently support MSM.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Contributors 14