Skip to content

Conversation

ggerganov
Copy link
Member

Utilize #12850 to improve the mat-mat MUL_MAT_ID performance:

  • Map src1 [n_embd, n_expert_used, n_tokens] -> hsrc1 [n_embd, n_tokens, n_expert]
  • Perform regular mat-mat multiplication src0 x hsrc1 with dynamic neh11(expert_id)
  • Unmap the result back to dst
./scripts/compare-commits.sh master gg/metal-mm-id-opt -m models/qwen3-30b-a3b/ggml-model-f16.gguf -m models/qwen3-30b-a3b/ggml-model-q8_0.gguf -m models/qwen3-30b-a3b/ggml-model-q4_0-pure.gguf -m models/mixtral-8x7b-32k-fast/ggml-model-q4_0.gguf -m models/nomic-embed-text-v2-moe/ggml-model-f16.gguf -fa 1 -p 512 -n 0 -t 1
Model Test t/s master t/s gg/metal-mm-id-opt Speedup
llama 8x7B Q4_0 pp512 295.20 651.44 2.21
nomic-bert-moe 475M F16 pp512 13083.98 24008.05 1.83
qwen3moe 30B.A3B F16 pp512 344.46 1400.07 4.06
qwen3moe 30B.A3B Q4_0 pp512 759.53 1359.49 1.79
qwen3moe 30B.A3B Q8_0 pp512 707.46 1350.71 1.91

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 8, 2025
@ggerganov ggerganov merged commit 611aa91 into master May 9, 2025
53 checks passed
@ggerganov ggerganov deleted the gg/metal-mm-id-opt branch May 9, 2025 12:15
LostRuins pushed a commit to LostRuins/koboldcpp that referenced this pull request May 9, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 9, 2025
* origin/master: (39 commits)
server : vision support via libmtmd (ggml-org#12898)
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858)
metal : optimize MoE for large batches (ggml-org#13388)
CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)
llama : do not crash if there is no CPU backend (ggml-org#13395)
CUDA: fix crash on large batch size for MoE models (ggml-org#13384)
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389)
llama-run: add support for downloading models from ModelScope (ggml-org#13370)
mtmd : fix batch_view for m-rope (ggml-org#13397)
llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398)
rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353)
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
server : (webui) rename has_multimodal --> modalities (ggml-org#13393)
ci : limit write permission to only the release step + fixes (ggml-org#13392)
mtmd : Expose helper_decode_image_chunk (ggml-org#13366)
server : (webui) fix a very small misalignment (ggml-org#13387)
server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365)
convert : support rope_scaling type and rope_type (ggml-org#13349)
mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381)
context : allow cache-less context for embeddings (ggml-org#13108)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant