Metal MSM

Metal-MSM v2 executes MSM on BN254 curve on Apple GPUs using Metal Shading Language (MSL). Unlike v1, which naively split the work into smaller tasks, v2 takes Tal and Koh’s WebGPU MSM in ZPrize2023 and the cuZK [LWY+23] approach as reference.

By adopting sparse matrices, it improves the Pippenger algorithm Pip76 with a more memory-efficient storage format and uses well-studied sparse matrix algorithms, such as sparse matrix–vector multiplication and sparse matrix transposition, in both the preprocessing phase (e.g., radix sort via sparse matrix transpose) and the bucket-accumulation phase to achieve high parallelism.

We took the WebGPU MSM reference and tuned it for all scales by auto-adjusting workgroup sizes for each cuZK shaders with SIMD width and the amount of GPU cores, squeezing out better GPU utilization. Plus, with dynamic window sizes, we speed up small and medium inputs (2^14 – 2^18) by eliminating unused sparse-matrix columns.

One thing to highlight is that our implementation runs most computations on the GPU, but it’s still slower than the CPU-only solution like Arkworks. However, because we target client-side devices with limited resources, applying a hybrid approach, leveraging both CPU and GPU for MSM tasks and combining the results at the end, can yield an implementation slightly faster than a pure-CPU one. Check the write-up below for estimated speedups with this hybrid method.

How to use

Metal MSM v2 works with arkworks v0.4.x; just include the crate in your Cargo.toml.

mopro-msm = { git = "https://github.com/zkmopro/gpu-acceleration.git", tag = "v0.2.0" }

Next, invoke MSM within your Rust code.

use mopro_msm::msm::metal_msm::{
    metal_variable_base_msm,
    test_utils::generate_random_bases_and_scalars,    // optional
};

fn main() {
    let input_size = 1 << 16;
    let (bases, scalars) = generate_random_bases_and_scalars(input_size);
    let msm_result = metal_variable_base_msm(&bases, &scalars);

    println!("Result: {:?}", msm_result);
}

Because it’s compatible with Arkworks, you can seamlessly swap between Metal MSM and the Arkworks MSM implementation.

#[cfg(test)]
mod tests {
    use super::*;
    use ark_bn254::{Fr as ScalarField, G1Projective as G};
    use ark_ec::{CurveGroup, VariableBaseMSM};
    use ark_std::{UniformRand, test_rng};

    #[test]
    fn test_msm() {
        let input_size = 1 << 10;

        // Generate random EC points and scalars with Arkworks
        let mut rng = test_rng();
        let bases = (0..input_size)
            .map(|_| G::rand(&mut rng).into_affine())
            .collect::<Vec<_>>();
        let scalars = (0..input_size)
            .map(|_| ScalarField::rand(&mut rng))
            .collect::<Vec<_>>();

        let metal_msm_result = metal_variable_base_msm(&bases, &scalars).unwrap();
        let arkworks_msm_result = G::msm(&bases, &scalars).unwrap();

        assert_eq!(metal_msm_result, arkworks_msm_result);    // the result is the same
    }
}

Benchmark

Benchmarking on BN254 curve ran on a MacBook Air with M3 chips, with test case setup time excluded.

Scheme	Input Size (ms)
Scheme	2¹²	2¹⁴	2¹⁶	2¹⁸	2²⁰	2²²	2²⁴
Arkworks v0.4.x (CPU, Baseline)	6	19	69	245	942	3,319	14,061
Metal MSM v0.1.0 (GPU)	143 (-23.8x)	273 (-14.4x)	1,730 (-25.1x)	10,277 (-41.9x)	41,019 (-43.5x)	555,877 (-167.5x)	N/A
Metal MSM v0.2.0 (GPU)	134 (-22.3x)	124 (-6.5x)	253 (-3.7x)	678 (-2.8x)	1,702 (-1.8x)	5,390 (-1.6x)	22,241 (-1.6x)
ICME WebGPU MSM (GPU)	N/A	N/A	2,719 (-39.4x)	5,418 (-22.1x)	17,475 (-18.6x)	N/A	N/A
ICICLE-Metal v3.8.0 (GPU)	59 (-9.8x)	54 (-2.8x)	89 (-1.3x)	149 (+1.6x)	421 (+2.2x)	1,288 (+2.6x)	4,945 (+2.8x)

side note:

for ICME WebGPU MSM, input size 2^12 causes M3 chip machines to crash; any sizes not listed on the project’s GitHub page are shown as "N/A"

for Metal MSM v0.1.0, the 2^24 benchmark was abandoned because it exceeded practical runtime

Profiling summary (v1 vs v2)

Environment: M1 Pro, macOS 15.2, curve ark_bn254, dataset 2^20 unless stated. Medians of 5 runs.

v2 → v1

metric	v1¹	v2²	gain
end-to-end latency	10.3 s	0.42 s	×24
GPU occupancy	32 %	76 %	+44 pp
CPU share	19 %	<3 %	–16 pp
peak VRAM	1.6 GB	220 MB	–7.3×

Key changes:

single sparse-matrix kernel eliminates most launches and memory thrash
CSR buckets keep data on-device → near-zero host↔GPU traffic
on-GPU radix sort makes preprocessing parallel

Benchmarking (how-to)

Metal-MSM v0.2.0 ships two Criterion benches:

Command	What it measures
`cargo bench -p mopro-msm --bench shaders -- --sample-size 100 --warm-up-time 3 --measurement-time 10`	Pure GPU time of each individual Metal shader (decompose, transpose, SMVP, PBPR …) across several input sizes. Host work is minimal so numbers reflect kernel performance.
`cargo bench -p mopro-msm --bench e2e -- --sample-size 100 --warm-up-time 3 --measurement-time 10`	Full `metal_variable_base_msm` pipeline – includes host-side preprocessing and postprocessing. Good for an overall sanity check, less ideal for micro-optimisation.

Flags after the double dash (--) are passed straight to Criterion, letting you tweak sample count or timing budget. Typical knobs:

--sample-size <n> – how many timed iterations to record (default 50).
--warm-up-time <seconds> – CPU/GPU warm-up before sampling starts.
--measurement-time <seconds> – total timed run per benchmark case.

The shaders bench uses the same logarithmic dataset sizes as the e2e bench (2^10, 2^12, 2^16). Modify LOG_SIZES inside benches/shaders.rs if you need different scales.

When tuning kernels focus on the shaders bench first; improvements there should propagate to the end-to-end numbers automatically.

Future

Technical Improvements

Modern Dependencies: Update to objc2 and objc2-metal (objc2)
Metal 4: Adopt latest Metal 4 features
Refactor with SIMD in mind:
- Instruction-level parallelism using vector types for faster FMA within SIMD groups
- Memory coalescing to increase locality (e.g., structure of array instead of array of structure)
- Optimized input reading patterns (e.g. [X_i || Y_i]_0^{n-1} instead of separate arrays)
- Latency hiding and occupancy fine-tuning
- Minimize thread divergence

Algorithm & Integration

CPU-GPU Hybrid: Research interleaving with CPU MSM crate and update to arkworks 0.5
Advanced Algorithms:
- Elastic MSM [ZHY+24] implementation
- Faster modular reduction with LogJump (article by Wei Jie, Barret-Montgomery)

Platform Expansion

Cross-platform: WGSL support with native execution environment
Crypto Math Library: Maintain a Metal/WebGPU crypto math library

Community

X account:
Telegram group:

Acknowledgements

This work was initially sponsored by a joint grant from PSE and 0xPARC. It is currently incubated by PSE.

Name		Name	Last commit message	Last commit date
Latest commit History 728 Commits
.github		.github
example-app		example-app
mopro-msm		mopro-msm
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Metal MSM

How to use

Benchmark

Profiling summary (v1 vs v2)

v2 → v1

Benchmarking (how-to)

Future

Technical Improvements

Algorithm & Integration

Platform Expansion

Community

Acknowledgements

About

Licenses found

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 14

Uh oh!

Languages

License

Licenses found

zkmopro/gpu-acceleration

Folders and files

Latest commit

History

Repository files navigation

Metal MSM

How to use

Benchmark

Profiling summary (v1 vs v2)

v2 → v1

Benchmarking (how-to)

Future

Technical Improvements

Algorithm & Integration

Platform Expansion

Community

Acknowledgements

Footnotes

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 14

Uh oh!

Languages

Packages