Skip to content

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jun 27, 2025

Fix #14415

TODO:

@github-actions github-actions bot added the python python script changes label Jun 27, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

Ok, getting somewhere now. The model runs, but output gibberish

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@ubergarm
Copy link

Thanks for working on this!

I got the same looking output trying llama-server on ngxson/xsn/hunyuan-moe@51886a47a with the freshly converted bf16.

The only odd things I noticed were:

  1. I had to pip install tiktoken to get it to convert
  2. Conversion had an odd warning WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
  3. On startup llama-server printed this warning:
load: control-looking token: 127957 '<|endoftext|>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

Tested on an AMD 7965WX 24x Core 256GB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) rig.

👈 a few more commands and logs fwiw

convert

python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
    /mnt/raid/models/tencent/Hunyuan-A13B-Instruct/

...

llama-server

model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf

./build/bin/llama-server \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 8192 \
  -ts 48,48 \
  -ngl 10 \
  --threads 24 \
  --host 127.0.0.1 \
  --port 8080

...

client

>>> User:

Tell a funny joke in English.

>>> Assistant:

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

@arch-btw
Copy link
Contributor

arch-btw commented Jun 27, 2025

I don't know as much about this as you guys but, could it be that the tokenizer is splitting characters like 新 ("new") into raw bytes?

So the UTF-8 sequence 0xe696b0 becomes 3 separate bytes (e6, 96, b0). And the other character 旧 ("old") splits into 3 bytes as well (e6, 97, a7).

And so the fragments get wrapped in [UNK_BYTE_] prefixes. The token stream becomes corrupt in the output and sort of traps the model in a "new --> old" loop, which then blocks normal text generation?

Because common Chinese characters always use 3 bytes in UTF-8:

  • converts to b'\xe6\x96\xb0' (3 bytes)
  • converts to b'\xe6\x97\xa7' (3 bytes)

It matches the error: [UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧]

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 27, 2025

The cgraph is still not correct. Testing with this tiny random weight: https://huggingface.co/ngxson/hunyuan-moe-tiny-random/tree/main

Seems like the problem is from the self-attention block

@kooshi
Copy link
Contributor

kooshi commented Jun 28, 2025

I don't know if the improvements I am seeing are from your last wip commit, or from my edits to the convert script, but I currently get almost intelligible responses.

The changes I made were:

  • specify the BOS token explicitly, as it is incorrect in hunyuan's config.json self.gguf_writer.add_bos_token_id(127959)
  • use tokenizer.special_tokens.values() instead of tokenizer.get_added_vocab() to determine control tokens
  • skip lm_head.weight as the embedding weights are tied
  • changed the base model from LlamaModel to TextModel for a more generic foundation

my edits are here: https://github.com/kooshi/llama.cpp/tree/hunyuan
full disclaimer though, I have no idea what I'm doing. The BOS token was definitely broken though.

> hello
<think>[UNK_BYTE_0x0a>
]Okay,[UNK_BYTE_0x20 the]the[UNK_BYTE_0x20 user]user[UNK_BYTE_0x20 said]said[UNK_BYTE_0x20 "]"hello".[UNK_BYTE_0x20 I]I[UNK_BYTE_0x20 need]need[UNK_BYTE_0x20 to]to[UNK_BYTE_0x20 respond]respond[UNK_BYTE_0x20 appropriately]appropriately.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]First,[UNK_BYTE_0x20 hello]hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello[UNK_BYTE_0x20 there]there![UNK_BYTE_0x0a!

][UNK_BYTE_0x0a!

]Hi[UNK_BYTE_0x20 there]there.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hey.[UNK_BYTE_0x0a.

(continues forever)

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

The more looking at the upstream implementation, the more I wonder if it actually works.

My Mac M3 Ultra can't load the original model even though having 512GB of RAM.

Now, testing with the tiny weight. Switching between eager and sdpa, they give different output result, which indicates that one of the 2 attn impl is buggy.

Also, flash_attn does not work at all, they haven't even verified the code path before shipping (NameError: name 'flash_attn_func' is not defined)

And more importantly, attention_mask is None everywhere, even using the example code provided on HF.

If that is true, it means they messed up badly this time.

@Downtown-Case
Copy link

modeling_hunyuan.py is basically identical to the file for the old hunyuan-large, with 1 changed line:

https://www.diffchecker.com/P3e0hQM5/

https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Instruct/

And hunyuan.py (the actual model class here) is largly copied from modeling_hunyuan.py, including unused features like CLA:

https://www.diffchecker.com/P9FIR5OD/

In other words, its almost Hunyuan large? I'm not sure why the HF attention implementations would be bugged. But other reimplementations like vllm's seem to work, so maybe they can shed some light on this:

quinnrong94/vllm@5302fbf

@Downtown-Case
Copy link

Downtown-Case commented Jun 28, 2025

I take that back, apparently vllm is only sometimes working with A13B, heh:

ikawrakow/ik_llama.cpp#561 (comment)

vllm-project/vllm#20183

vllm-project/vllm#20114

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

I had the original model from Huggingface work coherently on pure CPU. It uses the HunYuanSdpaAttention codepath.

This is all tentative as I just got it running at all:

If I compare logits for a single-token prompt, I get a very similar logit distribution from both llama.cpp and the HF. More than one token and things look different. I'm purely going with numerical token IDs for llama.cpp as the tokenizer is messed up as observed (I tried 'a' the token 64 for single-token prompt and '12' prompt (16, 17) for two-token test, e.g. llama-eval-callback --no-escape --model hunyuan-q8.gguf -n 1 -c 512 -p '12').

This is with the code from combined @ngxson and @kooshi with the .gguf made with @kooshi 's code (I took latest efforts I saw here in the discussion to start off).


Below in the dropbox is the transformers test program that makes coherent text for me (up to 100 tokens because I was too impatient to try longer prompts). I think installing accelerate and asking it to use bfloat16 really helps with memory. I think that would make it run on the M3 512GB machine too, IIRC when I did this for dots.llm1 I really had to use bfloat16 to not run out of memory.

My machine has 256GB of memory, a Hetzner server with a modern AMD EPYC CPU. I do have a Mac Studio (M2, 192GB) as well but for CPU work this Hetzner is usually much faster.

(I don't know why asking it to use bfloat16 helps, maybe it doesn't make giant copies of tensors or something when you ask it to do that; it's just something I observed and never checked what's it doing behind the scenes).

test.py

This is a version of the example code from the Huggingface page that I modified a bit.

#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=20)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()
stdout of test.py

The output has output as token IDs and as text (two prints()) in there. To run this, you need to install accelerate into your Python environment for the device_map line thingy.

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.09it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

I'm on and off this weekend trying to also figure out where computation graph is off exactly. If I find out before someone else does, I'll let you all know.

(Runs surprisingly fast on transformers+CPU, I'm used to that combo being extraordinarily slow. It is still very slow, just not like "it will take 30 minutes to make 10 tokens" slow).

@jacekpoplawski
Copy link
Contributor

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

@ubergarm
Copy link

ubergarm commented Jun 28, 2025

@jacekpoplawski

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

Their official inference script for running the int4 quant on vllm is using --dtype bfloat16

(still didn't work for me though)

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

To add to @ubergarm options, I did notice there are some quantized versions like https://huggingface.co/tencent/Hunyuan-A13B-Instruct-FP8 or https://huggingface.co/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 (they look like they are designed to work with transformers at first glance. I've never in my entire life ran vLLM or sglang even once.).

The GPTQ-Int4 one has a single model.safetensors at 43.7GB which maybe works. One would hope 😉

Haven't tried any of them. For computation graph work feels better to get whatever is highest precision I am able to run conveniently.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 28, 2025

If someone can run it, could you please verify if attention_mask inside HunYuanDecoderLayer has a non-Nonevalue? Thanks.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test2.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.91it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None

@ngxson is this the part you wanted to see if it's None or not? Argument to the forward()?

Screenshot 2025-06-28 at 12 28 33

Edit: took a bigger screenshot to show more clearly where I put that. HunYuanDecoderLayer's forward(). The line numbers you see won't match with original because I have more print() debugging at the top of the file and other hacky stuff I added.

Stdout tail because that first paste is cut off, I see None throughout the entire run. Output looks coherent.

Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

Edit2: I'm going to let this thing generate a full response which might take a while. But I feel this might be a bit short as a test; it almost verbatim mentions the prompt in the <think> so maybe it's about to repeat itself or something. I'll paste as a new comment when it's done. Just want to get more confirmation the HF implementation itself works beyond very short generations.

@Noeda
Copy link
Contributor

Noeda commented Jun 28, 2025

Full response example of the transformers version; I gave it 5000 token max:

stdout from test2.py (I cut off all the parts that said attention mask is None)
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914,    757,   1212,    555,  89746,
           1148,    358,   1440,     13,   5629,     11,   7106,   2890,   7720,
             25,  96931,    279,   4851,     11,  36050,  35855,     11,   8779,
            449,   4785,   6373,     13,   5112,  10723,   2890,     25,  26338,
           8631,     11,  18547,     11,  18710,     13,  10926,   1101,  67232,
           4907,   5990,     13,   8840,     11,    323,   1317,   9860,   6392,
           1093,  18189,   5326,    315,  21249,  19338,   2345,   8747,  16629,
             11,  63308,     11,   1063,  51423,     13,   8840,     11,    323,
          25702,   7720,     11,   1093,   2731,   5044,    477,   8271,    734,
             13,   6914,    757,  31335,   1521,   3585,    382,   3563,    449,
            459,  17219,    430,   5415,   5912,  10368,    706,  12387,   7720,
             13,   5112,   1464,   1523,   1139,   7106,     11,  10723,     11,
            323,   7344,   1023,  11306,     13,   1789,   7106,     25,   4851,
           2890,    320,   4620,    261,   4851,     11,   4827,   6680,   7410,
            705,   4785,   6373,    320,  22464,     82,  25247,     11,  22890,
          16124,    705,  22852,   1887,    320,  37860,    570,  38895,     25,
            842,  16751,   1354,     11,  26338,   8631,  56592,  16708,     11,
           3698,   1900,  18710,     13,  73235,     25,  57924,   5357,     11,
           5044,     11,   7344,  32174,  25702,  18174,     13,   7429,     11,
           3674,   7720,    422,   1912,  23783,     11,    719,   7344,    430,
            596,  10309,     13,  14998,    311,   2567,    433,  64694,     11,
            779,   7344,    220,     19,     12,     20,   1401,   3585,     13,
          35106,    503,  71921,     13,   7557,   2771,    433,  28555,     13,
           6914,    757,   1817,    422,    358,  13942,   4205,     13,   8840,
             11,   4907,   5990,   2345,  64562,    649,   5376,  61784,     11,
            539,   1120,   8395,  25247,     13,  22335,     11,    430,    596,
           3062,     13,   2100,  63179,    682,   1521,   1139,    264,  56887,
          14646,     13,   6914,    757,  10165,   1473,  31504,  10368,   6209,
            264,   7029,   2134,    315,   7720,    369,   8244,   1664,  33851,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          18899,  35855,    323,  46301,    279,   5326,    315,   4787,   1093,
          63308,    323,   4851,   8624,     11,   1418,  86387,    304,   4785,
           6373,   1555,  52703,  20252,    323,  16124,   4857,     13,  49693,
            750,     11,    433,  31854,    279,   4984,    315,    842,  16751,
           1354,     11,  18189,   8631,     11,  18547,     11,    323,  13803,
            315,  18710,     11,    323,  57924,  25702,    734,     11,  56028,
           5357,     11,   5044,     11,    323,  13893,  80430,   4325,  14228,
          10723,  18174,     13,  23212,     11,   5912,   5820,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,  11815,    264,  16643,
          22852,   1887,     11,  18189,  17563,   5326,     13,  32255,     11,
           1521,   6372,  17210,    311,    264,   5129,     11,  39345,     11,
            323,    810,  24770,   2324,    382,  14524,     11,    374,    430,
           2288,   1317,     30,  10926,  74481,     13,   6914,    757,   1518,
             13,    330,  31504,  10368,   5825,  62387,    582,  25489,   7720,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          73115,   6680,   7410,     11,  52797,   4785,   6373,     11,    323,
          67232,  40368,     13,  49693,    750,     11,    433,  19786,    842,
          16751,   1354,     11,  18189,   8631,     11,  18547,     11,    323,
          18710,     11,   1418,  47594,   5357,    323,   5044,     13,   1102,
           1101,  12992,   4907,   5990,    323,   1253,   7781,  25702,  18174,
             13,  28993,     11,    433,  39990,    264,   5129,     11,  39345,
           2324,   1210,   3011,    596,   2731,     13,   4497,  64694,     13,
           4343,    369,  32373,     13,  22335,     11,    430,   4375,     13,
           7557,   2771,    311,   6420,   1401,   5789,   2085,   3794,   2288,
          11944,     13,   3011,   1288,   3504,    433,    627,    524,  27963,
            397,     27,   9399,    397,  31504,  10368,  28421,  28254,   7720,
           4028,   7106,     11,  10723,     11,    323,  25702,  31576,     13,
          13101,   2740,     11,    433,  96931,    279,   4851,     11,  36050,
          35855,     11,    323,  73115,   6680,   7410,     11,  18189,    279,
           5326,    315,   4851,   8624,     11,  63308,     11,    323,  12943,
             13,   1102,  52797,   4785,   6373,    555,  20252,  25247,    323,
           4857,  16025,  16124,     11,   1418,   1101,  47594,  22852,    734,
             13,  49693,    750,     11,  10368,  31854,    842,  16751,    258,
           4984,     11,  46649,  23747,   8631,     11,  18547,     11,    323,
          13803,    315,  18710,     11,    323,  67232,   5357,     11,   5044,
             11,    323,  14604,  56062,     13,   1102,   4726,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,   1253,   7781,   4325,
          14228,  25702,  18174,     13,  21153,   3210,     11,   1521,   6372,
          12192,    264,   5129,     11,  39345,     11,    323,    810,  24770,
           2324,    627,    524,   9399,     29, 127960]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let me start by recalling what I know. First, physical health benefits: strengthens the heart, improves circulation, helps with weight management. Then mental health: reduces stress, anxiety, depression. Maybe also boosts energy levels. Oh, and long-term stuff like reducing risk of chronic diseases—diabetes, hypertension, some cancers. Oh, and cognitive benefits, like better memory or brain function. Let me organize these points.

Start with an introduction that states regular exercise has numerous benefits. Then break down into physical, mental, and maybe other categories. For physical: heart health (stronger heart, lower blood pressure), weight management (burns calories, builds muscle), immune system (maybe). Mental: endorphins, reduces stress/anxiety, combats depression. Cognitive: enhances focus, memory, maybe delays cognitive decline. Also, social benefits if group exercises, but maybe that's optional. Need to keep it concise, so maybe 4-5 key points. Avoid jargon. Make sure it flows. Let me check if I missed anything. Oh, energy levels—exercise can increase stamina, not just burn calories. Yeah, that's important. So summarize all these into a coherent paragraph. Let me draft:

Regular exercise offers a wide range of benefits for overall well-being. Physically, it strengthens the heart, improving circulation and lowering the risk of conditions like hypertension and heart disease, while aiding in weight management through calorie burning and muscle building. Mentally, it triggers the release of endorphins, reducing stress, anxiety, and symptoms of depression, and enhances cognitive function, boosting focus, memory, and potentially delaying age-related mental decline. Additionally, regular activity elevates energy levels by improving stamina and supports a stronger immune system, reducing illness risk. Together, these effects contribute to a longer, healthier, and more balanced life.

Wait, is that too long? Maybe shorten. Let me see. "Regular exercise provides multifaceted benefits. Physically, it strengthens the heart, lowers blood pressure, aids weight management, and boosts immunity. Mentally, it releases endorphins, reducing stress, anxiety, and depression, while enhancing focus and memory. It also increases energy levels and may delay cognitive decline. Overall, it promotes a longer, healthier life." That's better. More concise. Check for clarity. Yeah, that works. Make sure to mention key areas without getting too detailed. That should cover it.
</think>
<answer>
Regular exercise delivers profound benefits across physical, mental, and cognitive domains. Physically, it strengthens the heart, improves circulation, and lowers blood pressure, reducing the risk of heart disease, hypertension, and stroke. It aids weight management by burning calories and building lean muscle, while also enhancing immune function. Mentally, exercise triggers endorphin release, alleviating stress, anxiety, and symptoms of depression, and boosts focus, memory, and emotional resilience. It further elevates energy levels by improving stamina and may delay age-related cognitive decline. Collectively, these effects promote a longer, healthier, and more balanced life.
</answer><|eos|>

Code is almost same as before, pasting for reproducibility:

test2.py
#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()

The output looks normal to me and it answered the prompt. It does look like to me it works.

CPU-only, 256GB Hetzner server.

Copy link

@kzjeef kzjeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work,

I've pull the code test on a X86 CPU server, the fp16 and int8 inference is work, but seems result not quite accurate as the running on vLLM.

Just give some comments about model version, and also the chat template.

@@ -6436,6 +6439,155 @@ def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_audio_stack_factor(self.global_config["stack_factor"])


@ModelBase.register("HunYuanMoEV1ForCausalLM")
class HunYuanMoEModel(TextModel):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you align with Hunyuan's naming , with version V1 suffix?


@ModelBase.register("HunYuanMoEV1ForCausalLM")
class HunYuanMoEModel(TextModel):
model_arch = gguf.MODEL_ARCH.HUNYUAN_MOE
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could you add the version suffix on the arch name, like the arch name in model 's config.json ?

@@ -656,6 +657,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_ARCH.DOTS1: "dots1",
MODEL_ARCH.ARCEE: "arcee",
MODEL_ARCH.ERNIE4_5: "ernie4_5",
MODEL_ARCH.HUNYUAN_MOE: "hunyuan-moe",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hunyuan-moe-v1 will be a better name for later model updating.

@@ -117,6 +117,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_LLAMA4 = 33,
LLAMA_VOCAB_PRE_TYPE_PIXTRAL = 34,
LLAMA_VOCAB_PRE_TYPE_SEED_CODER = 35,
LLAMA_VOCAB_PRE_TYPE_HUNYUAN = 36,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version suffix on vocab type will be better.

@@ -77,6 +77,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_DOTS1, "dots1" },
{ LLM_ARCH_ARCEE, "arcee" },
{ LLM_ARCH_ERNIE4_5, "ernie4_5" },
{ LLM_ARCH_HUNYUAN_MOE, "hunyuan-moe" },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also .

@@ -665,6 +668,21 @@ int32_t llm_chat_apply_template(
if (add_ass) {
ss << "<|response|>";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_HUNYUAN_MOE) {
// tencent/Hunyuan-A13B-Instruct
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chat template of hunyuan a13b shoule be a much complex one ? with a quick and slow think option.

also the model default enable the slow think,

does llama cpp have some option on enable_think like the huggingface exmaple ?

@@ -1656,6 +1657,10 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
tokenizer_pre == "seed-coder") {
pre_type = LLAMA_VOCAB_PRE_TYPE_SEED_CODER;
clean_spaces = false;
} else if (
tokenizer_pre == "hunyuan") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer verison

@@ -815,6 +815,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35":
# ref: https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0
res = "minerva-7b"
if chkhsh == "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664":
# ref: https://huggingface.co/tencent/Hunyuan-A13B-Instruct
res = "hunyuan"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model name better with hunyuan A13B

@@ -137,6 +137,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "chatglm-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-chat", "chkhsh": "81d72c7348a9f0ebe86f23298d37debe0a5e71149e29bd283904c02262b27516"},
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", "chkhsh": "a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2"},
{"name": "minerva-7b", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0", "chkhsh": "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35"},
{"name": "hunyuan", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tencent/Hunyuan-A13B-Instruct", "chkhsh": "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model name should be hunyuan a13b, from my source , they will release more llm model soon, we'd better add some identify for the mdoel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tokenizer name, not model name

@BahamutRU
Copy link

Perfectly working commit, can you review and approve this, pls? @ggerganov 🥺🙏

@qingy1337
Copy link

Perfectly working commit, can you review and approve this, pls? @ggerganov 🥺🙏

I think the logits still have to be verified between the GGUF and the original model implementation (disabling the custom expert router mechanism) first. There hasn't been an update yet from @ngxson as to whether it does match.

@bennmann
Copy link

bennmann commented Jul 8, 2025

Based on community testing, these merges are coherent

#14425 (comment)

It's just a small improvement in the future to investigate the router block of code from #14425 (comment)

I encourage merge based on the evidence so far. Great looking model.

@kooshi
Copy link
Contributor

kooshi commented Jul 8, 2025

For the record, when I skimmed the vllm PR that added the "inference only" model code, it did not appear to implement the custom expert selection either.

I would also vote to merge as is, unless someone with the time and hardware can do some deeper comparisons with vllm at f16.

In the mean time, it's quite usable.

@qingy1337
Copy link

qingy1337 commented Jul 8, 2025

Just adding my +1 for merge; I went and tested the latest code with Q6_K from bullerwins/Hunyuan-A13B-Instruct-GGUF:

./llama-server -m ~/Hunyuan-A13B-Instruct-Q6_K-00001-of-00002.gguf -ngl 99 -c 16384 --host 0.0.0.0 --port 8181 --jinja

On H100 it looks really nice in terms of speed:

slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 179
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 179, n_tokens = 179, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 179, n_tokens = 179
slot      release: id  0 | task 0 | stop processing: n_past = 3211, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     450.27 ms /   179 tokens (    2.52 ms per token,   397.54 tokens per second)
       eval time =   36626.59 ms /  3033 tokens (   12.08 ms per token,    82.81 tokens per second)

Also llama-bench just for completeness:

ubuntu@lumpy-iris-fox-65d7c85d9b-vp97w:~/llama.cpp/build/bin$ ./llama-bench -m ~/Hunyuan-A13B-Instruct-Q6_K-00001-of-00002.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-moe A13B Q6_K          |  61.44 GiB |    80.39 B | CUDA       |  99 |           pp512 |       1956.58 ± 8.77 |
| hunyuan-moe A13B Q6_K          |  61.44 GiB |    80.39 B | CUDA       |  99 |           tg128 |         87.45 ± 1.40 |

build: e5fe0892 (5813)

Notes:

  • Nothing really noticeably wrong with the model, I tested with couple MATH-500 Level 5 questions and it got them all right.
  • No weird formatting issues in outputs.
  • /no_think & /think works as expected.

It looks good!

@ggerganov
Copy link
Member

For the record, when I skimmed the vllm PR that added the "inference only" model code, it did not appear to implement the custom expert selection either.

Ok, that sounds like a good explanation.

I would also vote to merge as is, unless someone with the time and hardware can do some deeper comparisons with vllm at f16.

@kooshi Earlier you said that the model behaves weird. Did something change?

@ggerganov ggerganov merged commit 8f22dc0 into ggml-org:master Jul 8, 2025
51 checks passed
@kooshi
Copy link
Contributor

kooshi commented Jul 8, 2025

@kooshi Earlier you said that the model behaves weird. Did something change?

The weirdness I was seeing may have been from my settings, or perhaps inherent to the model. It was quite smart, just stumbled over its own <answer> formatting in multiturn chats sometimes. I can't run the vllm version to compare (3 gpus and the model doesn't yet support pipeline parallel), so I'm not sure where the issue lies, if there is any.

Edit: thinking back, I was running it with --presence-penalty, just because I was using my qwen settings. That could have thrown it off. @ubergarm also reported multiturn issues, but was also using my settings iirc.

@Downtown-Case
Copy link

Downtown-Case commented Jul 8, 2025

@kooshi In my testing, it's extremely sensitive to sampling. The model is both very prone to loop, very sensitive to prompt formatting, yet "uncertain" about its own think formatting. It's also multiple tokens (eg not a single token), which gives it more opportunity to 'mess up.'

A relatively high MinP seems to help it behave. But the default sampling in some UIs would definitely trip it up.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 8, 2025
* origin/master:
model : fix hunyuan moe chat template (ggml-org#14584)
model : add SmolLM3 (ggml-org#14581)
memory : fix broken batch splits for recurrent cache (ggml-org#14575)
vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582)
server: Add ability to mount server at prefix (ggml-org#14544)
model : add hunyuan moe (ggml-org#14425)
vulkan: increase timeout for CI (ggml-org#14574)
cuda : fix rope with partial rotation and non-cont src (ggml-org#14580)
CUDA: add bilinear interpolation for upscale (ggml-org#14563)
musa: fix build warnings (unused variable) (ggml-org#14561)
llama : fix incorrect minicpm3 v_states shape (ggml-org#14571)
llama : remove ggml_cont where possible (ggml-org#14568)
@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

This model is broken for me. I converted the HF weights to GGUF this morning after the PR was merged and made a fresh Q4_K_M quantization. I'm getting lots of broken output and, as @Downtown-Case mentioned, the model doesn't seem to know how to format its own messages. It will close and open the <think> and <answer> blocks at random and often generates EOS early. I suspect a RoPE issue but I haven't been able to find it yet.

@ubergarm
Copy link

ubergarm commented Jul 8, 2025

@ddh0 did you try the very latest version that is a few hours old with the chat template fix: #14584 ?

I'm re-testing perplexity with that now

@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

Oh let me see. I'll try that now.

@ddh0
Copy link
Contributor

ddh0 commented Jul 8, 2025

It's working! I fed the model the entire llama.h file as it currently appears:

↕️ Click to expand llama-server console output ...

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
build: 5849 (6efcd659) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 192.168.68.66, port: 20480, http threads: 15
main: loading model
srv    load_model: loading model '/opt/workspace/gguf/Hunyuan-A13B-Instruct-Q4_K_X.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 15956 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 482 tensors from /opt/workspace/gguf/Hunyuan-A13B-Instruct-Q4_K_X.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = hunyuan-moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Hunyuan-A13B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Hunyuan
llama_model_loader: - kv   5:                         general.size_label str              = A13B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tencent-hunyuan-a13b
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/Tencent-Hunyuan/Hu...
llama_model_loader: - kv   9:                    hunyuan-moe.block_count u32              = 32
llama_model_loader: - kv  10:                 hunyuan-moe.context_length u32              = 262144
llama_model_loader: - kv  11:               hunyuan-moe.embedding_length u32              = 4096
llama_model_loader: - kv  12:            hunyuan-moe.feed_forward_length u32              = 3072
llama_model_loader: - kv  13:           hunyuan-moe.attention.head_count u32              = 32
llama_model_loader: - kv  14:        hunyuan-moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                 hunyuan-moe.rope.freq_base f32              = 11158840.000000
llama_model_loader: - kv  16: hunyuan-moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                   hunyuan-moe.expert_count u32              = 64
llama_model_loader: - kv  18: hunyuan-moe.expert_shared_feed_forward_length u32              = 3072
llama_model_loader: - kv  19:     hunyuan-moe.expert_feed_forward_length u32              = 3072
llama_model_loader: - kv  20:              hunyuan-moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:            hunyuan-moe.expert_shared_count u32              = 1
llama_model_loader: - kv  22:              hunyuan-moe.rope.scaling.type str              = none
llama_model_loader: - kv  23:            hunyuan-moe.rope.scaling.factor f32              = 1.000000
llama_model_loader: - kv  24: hunyuan-moe.rope.scaling.original_context_length u32              = 262144
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = hunyuan
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128167]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128167]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,127698]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 127959
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 127960
llama_model_loader: - kv  32:          tokenizer.ggml.seperator_token_id u32              = 127962
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 127961
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% set loop_messages = messages %}\n{%...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 15
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = /opt/workspace/imatrices/Hunyuan-A13B...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = imatrix-training-full-3
llama_model_loader: - kv  39:             quantize.imatrix.entries_count u32              = 352
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count u32              = 320
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q8_0:   64 tensors
llama_model_loader: - type q4_K:  161 tensors
llama_model_loader: - type q5_K:   96 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 45.38 GiB (4.85 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 210
load: token to piece cache size = 0.7868 MB
print_info: arch             = hunyuan-moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 64
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = none
print_info: freq_base_train  = 11158840.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = A13B
print_info: model params     = 80.39 B
print_info: general.name     = Hunyuan-A13B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128167
print_info: n_merges         = 127698
print_info: BOS token        = 127959 '<|bos|>'
print_info: EOS token        = 127960 '<|eos|>'
print_info: EOT token        = 127957 '<|endoftext|>'
print_info: SEP token        = 127962 '<|extra_0|>'
print_info: PAD token        = 127961 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 127957 '<|endoftext|>'
print_info: EOG token        = 127960 '<|eos|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        CUDA0 model buffer size =  1922.66 MiB
load_tensors:   CPU_Mapped model buffer size = 46459.90 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 98304
llama_context: n_ctx_per_seq = 98304
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 1024
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 11158840.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (98304) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 12288.00 MiB
llama_kv_cache_unified: size = 12288.00 MiB ( 98304 cells,  32 layers,  1 seqs), K (f16): 6144.00 MiB, V (f16): 6144.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =  1168.00 MiB
llama_context:  CUDA_Host compute buffer size =   400.01 MiB
llama_context: graph nodes  = 2183
llama_context: graph splits = 98 (with bs=1024), 66 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 98304
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 98304
main: model loaded
main: chat template, chat_template: {% set loop_messages = messages %}
{% if tools %}
    {% set weekday_map = {'Monday': '星期一', 'Tuesday': '星期二', 'Wednesday': '星期三', 'Thursday': '星期四', 'Friday': '星期五', 'Saturday': '星期六', 'Sunday': '星期日'} %}
    {% set weekday_cn = weekday_map[strftime_now('%A')] %}
    {% set datetime_str = strftime_now('%Y-%m-%d %H:%M:%S') %}
    {% set datetime_str = datetime_str + ' ' + weekday_cn %}
    {% for message in loop_messages %}
        {% if 'content' in message %}
            {% set content = message['content'] %}
        {% else %}
            {% set content = '' %}
        {% endif %}
        {% if loop.index0 == 0 %}
            {% set content_tmp = '你是一位函数组合专家。你会得到一个问题和一组可能的函数。根据问题,你需要进行一个或多个函数/工具调用以实现目的。
如果没有一个函数可以使用,请直接使用自然语言回复用户,以助手:开头。
如果给定的问题缺少函数所需的参数,请使用自然语言进行提问,向用户询问必要信息,以助手:开头。
如果调用结果已经足够回答用户问题,请对历史结果进行总结,使用自然语言回复用户,以助手:开头。
你应该只在工具调用部分返回函数调用。如果你决定调用任何函数,你必须将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>。你不应该在回复中包含任何其他文本。以下是你可以调用的函数列表,格式为JSON。
' %}
            {% set content_tmp = content_tmp + '
' + tools | tojson + '
' %}
            {% if message['role'] == 'system' %}
                {% set content_tmp = content_tmp + '
额外要求:
' + content + '

如果你决定返回函数调用,请将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>,不得包含其他文本。如果额外要求里有格式要求,请忽略,以此处为准。
否则,请参考开头说的三种情况,以助手:开头进行回复。

如果额外要求里有时间信息,就以额外要求里的时间为准,否则,参考当前时间:' + datetime_str %}
                {% set content = '<|startoftext|>' + content_tmp + '<|extra_4|>' %}
            {% elif message['role'] == 'user' %}
                {% set content_tmp = content_tmp + '
如果你决定返回函数调用,请将其格式化为<tool_calls>[{"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},...]</tool_calls>,不得包含其他文本。
否则,请参考开头说的三种情况,以助手:开头进行回复。

当前时间:' + datetime_str %}
                {% set content_tmp = '<|startoftext|>' + content_tmp + '<|extra_4|>'%}
                {% set content = content_tmp + '用户:' + content + '<|extra_0|>' %}
            {% endif %}
        {% else %}
            {% if message['role'] == 'user' %}
                {% set content = '用户:' + content + '<|extra_0|>' %}
            {% elif message['role'] == 'assistant' %}
                {% if 'tool_calls' in message %}
                    {% set tool_calls = message['tool_calls'] %}
                    {% set ns = namespace(tool_calls="[") %}
                    {% for tool_call in tool_calls %}
                        {% set function = tool_call['function'] %}
                        {% set name = function['name'] %}
                        {% set ns.tool_calls = ns.tool_calls + '{"name": "' + name + '", '%}
                        {% set arguments = function['arguments'] %}
                        {% if arguments is not string %}
                            {% set arguments = arguments | tojson %}
                        {% endif %}
                        {% set ns.tool_calls = ns.tool_calls + '"arguments": ' + arguments + '}' %}
                        {% if not loop.last %}
                            {% set ns.tool_calls = ns.tool_calls + ', '%}
                        {% endif %}
                    {% endfor %}
                    {% set ns.tool_calls = ns.tool_calls + ']' %}
                    {% set content = content + '<tool_calls>' + ns.tool_calls + '</tool_calls>' %}
                {% else %}
                    {% set content = '助手:' + content %}
                {% endif %}
                {% set content = content + '<|eos|>' %}
            {% elif message['role'] == 'tool' %}
                {% if content is not string %}
                    {set content = content | tojson }
                {% endif %}
                {% set content = '<tool_response>' + content + '</tool_response>' %}
                {% set content = content + '<|extra_0|>' %}
            {% endif %}
        {% endif %}
    {{- content -}}
    {% endfor %}
{% else %}
    {% set context = {'has_head': true} %}
    {% for message in loop_messages %}
        {% if 'content' in message %}
            {% set content = message['content'] %}
        {% else %}
            {% set content = '' %}
        {% endif %}
        {% if loop.index0 == 0 %}
            {% if content == '' %}
                {% set _ = context.update({'has_head': false}) %}
            {% elif message['role'] == 'system' %}
                {% set content = '<|startoftext|>' + content + '<|extra_4|>' %}
            {% endif %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {% if loop.index0 == 1 and not context.has_head %}
                {% set content = '<|startoftext|>' + content %}
            {% endif %}
            {% if loop.index0 == 1 and context.has_head %}
                {% set content = content + '<|extra_0|>' %}
            {% else %}
                {% set content = '<|startoftext|>' + content + '<|extra_0|>' %}
            {% endif %}
        {% elif message['role'] == 'assistant' %}
            {% set content = content + '<|eos|>' %}
        {% elif message['role'] == 'tool' %}
            {% set content = content + '<|extra_0|>' %}
        {% endif %}
        {{- content -}}
    {% endfor %}
{% endif %}
{%- if enable_thinking is defined and enable_thinking is false %}
    {{- '<think>\n\n</think>\n' }}
{%- endif %}, example_format: '<|startoftext|>You are a helpful assistant<|extra_4|><|startoftext|>Hello<|extra_0|><|startoftext|>Hi there<|eos|><|startoftext|>How are you?<|extra_0|>'
main: server is listening on http://192.168.68.66:20480 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 98304, n_keep = 0, n_prompt_tokens = 17856
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 4096, progress = 0.229391
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 4096, progress = 0.458781
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 4096, progress = 0.688172
slot update_slots: id  0 | task 0 | kv cache rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 4096, progress = 0.917563
slot update_slots: id  0 | task 0 | kv cache rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 17856, n_tokens = 1472, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 17856, n_tokens = 1472
slot      release: id  0 | task 0 | stop processing: n_past = 22788, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =   79030.58 ms / 17856 tokens (    4.43 ms per token,   225.94 tokens per second)
       eval time =  575503.98 ms /  4933 tokens (  116.66 ms per token,     8.57 tokens per second)
      total time =  654534.55 ms / 22789 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.68.58 200


↕️ Click to expand model response ...

<think>
Alright, here's my breakdown of the `llama.h` file. I'm seeing a header file for the `llama.cpp` library, which is pretty popular for running LLaMA and other large language models on consumer-grade hardware. My goal is to give a concise summary of its structure, purpose, and key components.

First, the file begins with include guards, which is good practice to prevent multiple inclusions. It's pulling in some GGML-related headers – `ggml.h`, `ggml-cpu.h`, `ggml-backend.h`, and `ggml-opt.h` – which tells me this library is built on top of GGML, a GPU-accelerated math library. I also see standard C libraries like `stddef.h`, `stdint.h`, `stdio.h`, and `stdbool.h`.

Then, there's some conditional compilation for defining `LLAMA_SHARED` and `LLAMA_API`. This is typical for library development, allowing for export of symbols on Windows when building as a DLL and import otherwise. Next, a bunch of constants and defines, like `LLAMA_DEFAULT_SEED`, `LLAMA_TOKEN_NULL`, and magic numbers like `LLAMA_FILE_MAGIC_GGLA`, which seem to be used for identifying file formats. I see version numbers too, like `LLAMA_SESSION_VERSION` and `LLAMA_STATE_SEQ_VERSION`.

Now the real meat begins – the C interface declarations. I see structs like `llama_vocab`, `llama_model`, `llama_context`, and `llama_sampler`. `llama_vocab` likely handles the vocabulary, which is crucial for tokenization. `llama_model` probably holds the loaded model parameters. `llama_context` seems to manage the state during inference, like the current sequence and KV cache. `llama_sampler` is for the sampling strategies used during text generation, like greedy or top-p.

There are several typedefs for common types like `llama_token`, `llama_seq_id`, and pointers like `llama_memory_t`. Enums are everywhere, defining types like `llama_vocab_type`, `llama_rope_type`, `llama_token_type`, etc. These enums provide a clean way to specify different options for things like tokenization methods, attention types, and memory management.

I see a struct `llama_model_params` which looks like the configuration for loading a model. It has info about devices, tensor types, offloading options, progress callbacks, etc. `llama_context_params` seems similar but for the context itself, defining things like batch sizes, thread counts, and attention settings. `llama_model_quantize_params` is probably for the quantization process, allowing models to run on devices with less memory.

The API functions are the heart of the library. I see functions for initializing the backend (`llama_backend_init`), loading models (`llama_model_load_from_file`, `llama_model_load_from_splits`), managing memory (`llama_memory_clear`, `llama_memory_seq_rm`), and running inference (`llama_decode`, `llama_encode`). There are also functions for tokenization (`llama_tokenize`, `llama_detokenize`), chat templates (`llama_chat_apply_template`), and sampling (`llama_sampler_chain_init`, `llama_sampler_sample`).

I see some deprecated functions, indicated by the `DEPRECATED` macro. These are old interfaces that have been replaced by newer ones. For example, `llama_free_model` is deprecated in favor of `llama_model_free`.

There's also a section for "Adapters", which suggests a way to modify models, likely through LoRA (Low-Rank Adaptation). `llama_adapter_lora` and related functions handle loading, applying, and managing these adapters.

Finally, there are functions for performance measurement (`llama_perf_context_print`), logging (`llama_log_set`), and training-related features (`llama_opt_init`, `llama_opt_epoch`).

Overall, this `llama.h` file provides a comprehensive C API for interacting with the `llama.cpp` library. It covers everything from loading models, managing memory and context, tokenizing text, running inference with different sampling strategies, and even fine-tuning through adapters. The use of structs and enums for configuration makes the API expressive and flexible. The comments suggest that some parts are still evolving, like the sampling API and memory management features. It's clear that this library aims to be efficient, with support for GPU acceleration and various quantization strategies, making large language models accessible on a wide range of hardware.
</think>
<answer>
Okay, here's a detailed analysis of the `llama.h` header file, focusing on its purpose, structure, and key components.

**Overall Purpose:**

This header file defines the public C interface for the `llama.cpp` library. `llama.cpp` is an open-source C/C++ library that implements the LLaMA (Large Language Model Meta AI) architecture, enabling efficient inference of large language models on consumer-grade GPUs and CPUs. The header declares the data structures (structs, enums, typedefs) and function signatures that external applications would use to interact with the library.

**Key Components:**

1.  **Includes and Basic Definitions:**
    *   `#ifndef LLAMA_H ... #endif`: Standard include guard to prevent multiple inclusions of the header.
    *   `#include "ggml.h"`, `#include "ggml-cpu.h"`, `#include "ggml-backend.h"`, `#include "ggml-opt.h"`: Includes the core GGML library headers. GGML is a GPU-accelerated math library, which `llama.cpp` leverages for performance. `ggml-cpu.h` provides CPU-specific backend implementations, while `ggml-backend.h` and `ggml-opt.h` offer backend abstraction and optimization capabilities.
    *   `#include <stddef.h>`, `#include <stdint.h>`, `#include <stdio.h>`, `#include <stdbool.h>`: Standard C library headers for common types (`size_t`, `int32_t`, etc.), I/O (`stdio.h`), and boolean values (`stdbool.h`).
    *   **Macros:**
        *   `LLAMA_SHARED`, `LLAMA_API`: Conditional compilation to define `LLAMA_API`. This is crucial for library distribution. On Windows (`_WIN32`), if `LLAMA_SHARED` is defined and the library is being built (`LLAMA_BUILD`), it exports symbols using `__declspec(dllexport)`. If included by an application, it imports symbols using `__declspec(dllimport)`. On other platforms (Linux, macOS), it uses `__attribute__((visibility ("default")))` to make symbols visible by default. If `LLAMA_SHARED` is not defined, `LLAMA_API` is empty.
        *   `DEPRECATED(func, hint)`: A macro to mark functions as deprecated, providing a compiler warning with a specific hint (the second argument). This helps users migrate code to newer API versions.
        *   `LLAMA_DEFAULT_SEED`: Default value for random number generation seeds.
        *   `LLAMA_TOKEN_NULL`: A special value (`-1`) to represent a null or invalid token.
        *   **File Magic Numbers:** Constants like `LLAMA_FILE_MAGIC_GGLA`, `LLAMA_FILE_MAGIC_GGSN`, `LLAMA_FILE_MAGIC_GGSQ` (0x67676c61u, 0x6767736eu, 0x67677371u) are used as file signatures to identify different types of GGUF model files. GGLA might relate to a specific GGUF variant or internal format. GGSN and GGSQ are used for GGUF state/session files.
        *   **Version Numbers:** `LLAMA_SESSION_MAGIC` (using GGSN magic), `LLAMA_SESSION_VERSION` (9), `LLAMA_STATE_SEQ_MAGIC` (using GGSQ magic), and `LLAMA_STATE_SEQ_VERSION` (2) are used to structure session/state files written by the library.

2.  **C Interface:** The `extern "C"` block indicates that the following C++ code (if present) should be compiled as C, ensuring compatibility with C applications.
    *   **Struct Definitions:** These are the core data structures users interact with.
        *   `struct llama_vocab;`: Represents the vocabulary used by the model (token mapping, special tokens, tokenizer type).
        *   `struct llama_model;`: Holds the loaded model parameters (weights, configurations).
        *   `struct llama_context;`: Represents an inference session. It holds the current state, KV cache, memory manager, and parameters for this specific generation run.
        *   `struct llama_sampler;`: Manages sampling strategies for token selection during generation.
        *   `typedef struct llama_memory_i * llama_memory_t;`: A forward declaration for a pointer to an internal memory management structure. (Note: `struct llama_kv_cache` is deprecated in favor of `llama_memory_t`).
        *   **Token Types:** `llama_pos` (position), `llama_token` (integer ID), `llama_seq_id` (ID for a sequence within a multi-sequence context).
        *   **Enums:** Extensive enums provide type-safe options for various configurations.
            *   `llama_vocab_type`: Defines how tokens are represented (SPM, BPE, WPM, UGM, RWKV). This affects tokenizer and decoder behavior.
            *   `llama_vocab_pre_type`: Specifies preprocessing methods for special tokens, potentially tied to specific model families or tokenizers.
            *   `llama_rope_type`: Defines Rotary Position Embedding (RoPE) types (None, Normal, NeoX, MROPE, Vision). RoPE is crucial for capturing relative positions in long sequences.
            *   `llama_token_type`: (TODO: remove) Enum for token attributes (e.g., normal, control, user-defined). The comment indicates these are temporary until per-token attributes are natively supported.
            *   `llama_token_attr`: Detailed attributes for individual tokens (e.g., normalization, stripping whitespace, single-word mode).
            *   `llama_ftype`: Enum for model weight formats (F32, F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, IQ2_XXS, IQ3_XS, IQ4_NL, BF16, etc.). This dictates how model weights are stored and used, impacting memory usage and precision.
            *   `llama_rope_scaling_type`: Enum for scaling factors in RoPE (None, Linear, YARN, LongRoPE).
            *   `llama_pooling_type`: Enum for output embedding aggregation methods (None, Mean, CLS, Rank).
            *   `llama_attention_type`: Enum for attention mechanisms (Causal, Non-Causal).
            *   `llama_split_mode`: Enum for model parallelism strategies (None, Layer-wise, Row-wise tensor parallelism).
            *   `llama_model_kv_override_type`, `llama_model_tensor_buft_override`: Structs for overriding model KV cache types or tensor buffer types, useful for advanced use cases like custom memory allocation.
    *   **Typedefs:** Simplifies usage of common types like `int32_t`, `int64_t`, `size_t`.
    *   **Token Data Structures:**
        *   `llama_token_data`: A struct holding a token's ID, log-odds (raw score), and probability.
        *   `llama_token_data_array`: Holds an array of `llama_token_data`, along with metadata like sorting status. Used by samplers.

3.  **Helper Functions:**
    *   `llama_progress_callback`, `llama_abort_callback`: Function pointer types for progress and abort notifications.
    *   `llama_batch`: A struct to pass input data (tokens, embeddings, positions) and parameters (logits output flag) to `llama_decode`. It supports multiple sequences (multi-prompt).
    *   `llama_model_kv_override`, `llama_model_tensor_buft_override`: Structs for specifying overrides for KV cache types and tensor buffer types during model loading.

4.  **Model Parameters and Context Defaults:**
    *   `llama_model_default_params()`, `llama_context_default_params()`, `llama_sampler_chain_default_params()`, `llama_model_quantize_default_params()`: Functions that return instances of the parameter structs with sensible default values. These are useful for quick initialization.

5.  **Core Library Functions:**
    *   **Initialization/Finalization:**
        *   `llama_backend_init()`: Initializes the GGML backend (e.g., CUDA, Metal, OpenCL, CPU). Call once at the start.
        *   `llama_backend_free()`: Cleans up the backend resources. Call once at the end.
        *   `llama_numa_init()`: (Optional) Sets NUMA awareness for memory allocation.
        *   `llama_attach_threadpool()`, `llama_detach_threadpool()`: Manages threadpools for computation and batching.
    *   **Model Loading/Saving:**
        *   `llama_model_load_from_file()`, `llama_model_load_from_splits()`: Loads a `.gguf` model file (potentially split into multiple parts).
        *   `llama_model_save_to_file()`: Saves the model to a file.
        *   `llama_model_free()`: Frees memory associated with a loaded `llama_model` object.
    *   **Context Management:**
        *   `llama_init_from_model()`, `llama_new_context_with_model()`: Creates an inference context from a loaded model.
        *   `llama_free()`: Frees all resources associated with a `llama_context`.
        *   `llama_get_model()`, `llama_get_memory()`: Accessors for the model and memory within a context.
        *   `llama_time_us()`: Gets current time in microseconds.
        *   `llama_max_devices()`, `llama_max_parallel_sequences()`: Utility functions.
        *   `llama_supports_mmap()`, `llama_supports_mlock()`, `llama_supports_gpu_offload()`, `llama_supports_rpc()`: Check for feature support.
        *   `llama_n_ctx()`, `llama_n_batch()`, `llama_n_ubatch()`, `llama_n_seq_max()`: Get context configuration.
        *   `llama_model_n_ctx_train()`, `llama_model_n_embd()`, `llama_model_n_layer()`, `llama_model_n_head()`, `llama_model_n_head_kv()`, `llama_model_n_swa()`: Get model architecture details.
        *   `llama_model_rope_freq_scale_train()`: Gets RoPE frequency scaling factor.
        *   `llama_model_n_cls_out()`, `llama_model_cls_label()`: Get classifier head details.
        *   `llama_vocab_type()`, `llama_vocab_n_tokens()`: Query vocabulary information.
        *   `llama_model_meta_val_str()`, `llama_model_meta_count()`, `llama_model_meta_key_by_index()`, `llama_model_meta_val_str_by_index()`: Functions to inspect GGUF metadata.
        *   `llama_model_desc()`: Gets a human-readable description of the model.
        *   `llama_model_size()`: Gets the total size of model parameters in bytes.
        *   `llama_model_chat_template()`: Retrieves the chat template.
        *   `llama_model_n_params()`: Gets the total number of parameters.
        *   `llama_model_has_encoder()`, `llama_model_has_decoder()`: Check model type.
        *   `llama_model_decoder_start_token()`: Gets the token ID to start decoder generation.
        *   `llama_model_is_recurrent()`: Checks if the model is recurrent (e.g., RWKV).
    *   **Memory Management (Modern KV Cache):**
        *   `llama_memory_clear()`: Clears memory contents and optionally data buffers.
        *   `llama_memory_seq_rm()`, `llama_memory_seq_cp()`, `llama_memory_seq_keep()`, `llama_memory_seq_add()`, `llama_memory_seq_div()`: Functions to manipulate token sequences within the memory manager (e.g., removing a sequence, copying between sequences, adding offsets to positions). These replace the older KV cache deprecation methods.
        *   `llama_memory_can_shift()`: Checks if memory supports shifting operations.
    *   **KV Cache (Deprecated):** Functions like `llama_kv_self_...` and `llama_kv_...` (clear, seq_rm, seq_cp, etc.) are marked as deprecated, indicating the shift to `llama_memory_t`. These old functions likely operated directly on the context's KV cache structure.
    *   **State Management (Sessions/Checkpoints):**
        *   Functions like `llama_state_get_size()`, `llama_state_get_data()`, `llama_state_set_data()` allow saving and restoring the entire inference state (logits, embeddings, memory) to/from disk. This is useful for resuming generation or sharing checkpoints.
        *   Functions like `llama_state_seq_get_size()`, `llama_state_seq_get_data()`, `llama_state_seq_set_data()` handle state for individual sequences, crucial for multi-turn conversations or stateful generation using memory.
        *   `llama_state_load_file()`, `llama_state_save_file()`, `llama_state_seq_load_file()`, `llama_state_seq_save_file()` are convenience wrappers for these functions.
    *   **Decoding (Inference):**
        *   `llama_encode()`: Processes input tokens using the encoder part of the model (if present in an encoder-decoder model).
        *   `llama_decode()`: The core function for generating text. It processes the input batch using the decoder, utilizes the KV cache, and updates the context state. It returns status codes and allows for error handling.
        *   `llama_set_n_threads()`, `llama_n_threads()`, `llama_n_threads_batch()`: Configure and query the number of threads for generation and batching.
        *   `llama_set_embeddings()`: Controls whether the model should compute and return output embeddings.
        *   `llama_set_causal_attn()`: Enables/disables causal attention masking.
        *   `llama_set_warmup()`: (Likely experimental) Activates warmup strategies where all model tensors are loaded into VRAM.
        *   `llama_set_abort_callback()`: Sets a callback to allow user intervention to abort generation.
        *   `llama_synchronize()`: Ensures all pending GPU/CPU operations are complete.
    *   **Accessing Results:**
        *   `llama_get_logits()`: Retrieves the raw logits from the last `llama_decode` call.
        *   `llama_get_logits_ith()`: Gets logits for a specific token index within the logits array.
        *   `llama_get_embeddings()`, `llama_get_embeddings_ith()`: Retrieves output token embeddings.
        *   `llama_get_embeddings_seq()`: Retrieves embeddings for a specific sequence ID.
    *   **Vocabulary Access:**
        *   `llama_vocab_get_text()`, `llama_vocab_get_score()`, `llama_vocab_get_attr()`: Get information about a specific token.
        *   `llama_vocab_is_eog()`, `llama_vocab_is_control()`: Check token properties.
        *   `llama_vocab_bos()`, `llama_vocab_eos()`, etc.: Access special tokens (Beginning-of-Sentence, End-of-Sentence, etc.).
    *   **Tokenization:**
        *   `llama_tokenize()`: Converts text input into a sequence of token IDs.
        *   `llama_token_to_piece()`: Converts an integer token ID back into its string representation (piece).
        *   `llama_detokenize()`: Converts an array of token IDs back into text.
    *   **Chat Templates:**
        *   `llama_chat_apply_template()`: Formats a conversation (array of `llama_chat_message` structs) into a single prompt string using a specified template. It supports built-in templates.
        *   `llama_chat_builtin_templates()`: Retrieves a list of available built-in template names.
    *   **Sampling API:**
        *   `llama_sampler_chain_init()`: Creates a sampling chain, which allows chaining multiple sampling methods (e.g., top-k followed by temperature).
        *   `llama_sampler_chain_add()`: Adds a sampler (like greedy, top-p, mirostat) to the chain.
        *   `llama_sampler_chain_get()`, `llama_sampler_chain_n()`, `llama_sampler_chain_remove()`: Manage the samplers within a chain.
        *   `llama_sampler_free()`: Frees memory associated with a sampler chain.
        *   Various `llama_sampler_init_...` functions define different sampling strategies (Greedy, Top-k, Top-p, Minimum-p, Typical, Temp, Mirostat, Grammar-based, etc.). Each returns a `llama_sampler` pointer.
        *   `llama_sampler_name()`, `llama_sampler_accept()`, `llama_sampler_apply()`, `llama_sampler_reset()`, `llama_sampler_clone()`, `llama_sampler_free()`: Functions to operate on a `llama_sampler` object once obtained.
        *   `llama_sampler_sample()`: A helper to perform sampling from the last token's logits using a specified sampler in a chain and accept the chosen token.
    *   **Performance Utilities:**
        *   `llama_perf_context_data`, `llama_perf_sampler_data`: Structs to store performance metrics for context-level and sampler-level operations.
        *   `llama_perf_context()`, `llama_perf_context_print()`, `llama_perf_context_reset()`: Functions to measure and report context-based timing.
        *   `llama_perf_sampler()`, `llama_perf_sampler_print()`, `llama_perf_sampler_reset()`: Functions to measure and report sampler-based timing.
    *   **Training (Experimental):**
        *   `llama_opt_init()`, `llama_opt_epoch()`: Functions related to fine-tuning/optimization passes, using callback functions for parameter filtering, optimizer parameter calculation, and epoch callbacks. This part seems less mature or feature-complete than the inference part.

**Summary:**

`llama.h` provides a comprehensive C API for interacting with the `llama.cpp` library. It abstracts the underlying complexity of model loading, memory management, GPU acceleration (via GGML), tokenization, and diverse sampling strategies. The use of structs and enums for configuration, clear function naming, and detailed comments make it relatively easy to integrate `llama.cpp` into applications for tasks like text generation, chatbots, and other NLP applications. The API supports various quantization levels, model splitting, multi-turn generation via memory management, and customizable sampling, caterering to a wide range of hardware capabilities and user needs. The presence of deprecated functions and experimental APIs (like the training part) suggests the library is actively developed and evolving.
</answer>

Very impressive!

@danielhanchen
Copy link
Contributor

Re perplexity values - I'm getting PPL increasing from 1 to 3 to 5 to 7 to 27 to 37 and now 227 :(

@kzjeef
Copy link

kzjeef commented Jul 9, 2025

Re perplexity values - I'm getting PPL increasing from 1 to 3 to 5 to 7 to 27 to 37 and now 227 :(

Hi @danielhanchen

This sounds not good, do you have apply this MR when testing?

Seems the chat template should be fixed in this PR.
#14584

@danielhanchen
Copy link
Contributor

@kzjeef Yes I recompiled from source - I'll see how high the PPL goes - I'll still try to make some quants!

@kooshi
Copy link
Contributor

kooshi commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

@kzjeef
Copy link

kzjeef commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

About the reason parser, what's the location of llama.cpp ? i'm working on vllm's reason parser(vllm-project/vllm#20625), maybe some one or myself can porting this to llama.cpp

Actually we have tested some complex math case internally after this PR: #14584
, it looks good.

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

@danielhanchen
Copy link
Contributor

danielhanchen commented Jul 9, 2025

So my imatrix gets Final estimate: PPL = 188.6129 +/- 1.33950 so very high - I also used the chat template directly, and it increases over time.

However I think as someone mentioned it's due to <answer></answer> and so the PPL shoots up - I'm not 100% sure, but it's very likely.

Quants at https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF

Usage:

./llama.cpp/llama-cli -hf unsloth/Hunyuan-A13B-Instruct-GGUF:Q4_K_XL -ngl 99 --jinja --temp 0.7 --top-k 20 --top-p 0.8 --repeat-penalty 1.05

@jukofyork
Copy link
Collaborator

jukofyork commented Jul 9, 2025

The tested PPL has been absurdly high in every test of the Instruct model, including the official implementation, despite it being coherent in chats. The base model gives a perfectly reasonable score: #14425 (comment)

If anyone can verify what it's actually predicting it might help (probably trying to start with <answer> or something).

I hope it doesn't get in the way of the heuristics for the dynamic quants. I always look forward to them.

Probably the easiest way to see what is causing this is to start generation from a single BOS token and see what it generates with temperature = 1.

EDIT: I think this also shows it might be time to consider letting llama-imatrix use data inside chat templates (even if the data isn't really chat data). Some of the newer models seem to be using crazy detailed templates now and using way more fine-tuning data than they used to, so this sort of problem is only likely to get worse in the future!

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
@ggerganov ggerganov added the hot Something that is hot label Jul 11, 2025
olek-tether pushed a commit to tetherto/qvac-ext-lib-llama.cpp that referenced this pull request Aug 15, 2025
* sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)

* ggml : do not output unprintable characters on GGUF load failure (#14381)

* ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)

* ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

* ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

* ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

* ggml-cpu: better variable names

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

* docs: update s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

* ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with
    gdb. we will need to investigate further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rework noop

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add missing __func__

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: diagnose why __NNPA__ macro is not being defined

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update macro tests

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: test more macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add debug prints

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 157f856c34589566151630e294563a420702db39)

* ggml-cpu: move things around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

* ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* fix ggml time initialization

* fix f32_f16 table init

* remove extra line

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: slaren <slarengh@gmail.com>

* musa: enable fp16 mma (all) and cublas on qy2 (#13842)

* musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* docs: update s390x documentation + add faq (#14389)

* docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* metal : batch rows copy in a single threadgroup (#14384)

* metal : batch rows copy in a single threadgroup

ggml-ci

* metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

* metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

* llama : return mistral-v7-tekken as default template only (#14390)

* cmake: regen vulkan shaders when shaders-gen sources change (#14398)

* Add shaders-gen sources as target deps

* model : gemma3n text-only (#14400)

* gemma3n

* add llm_graph_input_one

* convert : fix broken sentencepiece vocab (#14416)

* ggml : add ggml_set_rows (#14274)

* ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.

ref: #8366

* use I64 for indices

* ggml : add repeat impl for i64

* ggml : add ggml_is_contiguous_rows

* ggml : ggml_set_rows support broadcast

* ggml : ggml_set_rows support quantized dst

ggml-ci

* ggml : support GGML_TYPE_F32 ".from_float" trait

* ggml : ggml_set_rows update comment + better index name

* tests : add ggml_set_rows

* metal : add ggml_set_rows implementation

ggml-ci

* ggml : simplify forward_dup_f32

* ggml : fix supports_op

* tests : add comment to set_rows

* ggml : leave the repeat_i64 for a separate PR

ggml-ci

* ggml : set_rows use std::min instead of MIN

* ggml : better error message for set_rows unsupported type

* metal : perform op->type check only once

* tests : more consistent implementation + more tests

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

* graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

* vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

* ci : fix windows build and release (#14431)

* fix async_mode bug (#14432)

* model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>

* vulkan: lock accesses of pinned_memory vector (#14333)

* vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)

* CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)

* CUDA: add bf16 and f32 support to cublas_mul_mat_batched

* Review: add type traits and make function more generic

* Review: make check more explicit, add back comments, and fix formatting

* Review: fix formatting, remove useless type conversion, fix naming for bools

* vulkan: Add fusion support for RMS_NORM+MUL (#14366)

* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>

* ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)

* implement unary REGLU/GEGLU/SWIGLU cpu ops

* relax constraints

* duplicate shape of source

* fix ggml_vec_geglu_f16

* special case gated ops

* implement unary REGLU/GEGLU/SWIGLU cuda ops

* tighten constraints again

* refactor into GGML_GLU_OP

* metal : add glu kernels

ggml-ci

* add CUDA_GLU_BLOCK_SIZE [no ci]

* more constraints and use 64bit ints

ggml-ci

* 64bit multiplication [no ci]

* implement swapped variants (cpu/cuda)

* update comment [no ci]

ggml-ci

* Vulkan: Add GLU ops and shaders

* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

* ggml : implement GLU for split up/gate (#14181)

* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>

* GGML: increase OP count in assertion

* Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

* vulkan: Increase workgroup size for GLU, for performance (#14345)

* vulkan: Increase workgroup size for GLU, for performance

* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup

* merge fix

* metal : add support for split and swap

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)

* SYCL: disable faulty fp16 exp kernel (#14395)

* SYCL: disable faulty fp16 CPU exponent for now

* Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

* SYCL: disable faulty fp16 CPU exponent for now

* Fix logic of disabling exponent kernel

* server : fix appearance of the chats list context menu for Safari (#14322)

* server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)

* initial commit for handling extra template kwargs

* enable_thinking and assistant prefill cannot be enabled at the same time

* can set chat_template_kwargs in command line

* added doc

* fixed formatting

* add support for extra context in generic template init

* coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* coding standard:  common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix merge conflict

* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)

* normalize environment variable name

* simplify code

* prefill cannot be used with thinking models

* compatibility with the new reasoning-budget parameter

* fix prefill for non thinking models

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>

* scripts : make the shell scripts cross-platform (#14341)

* cmake : Remove redundant include path in CMakeLists.txt (#14452)

* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

* Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

* test-backend-ops : disable llama test (#14461)

* ggml-cpu: sycl: Re-enable exp f16 (#14462)

* metal : disable fast-math for some cpy kernels (#14460)

* metal : disable fast-math for some cpy kernels

ggml-ci

* cont : disable for q4_1

ggml-ci

* cont : disable for iq4_nl

ggml-ci

* memory : correctly handle failure in apply() (#14438)

ggml-ci

* Add Conv2d for CPU (#14388)

* Conv2D: Add CPU version

* Half decent

* Tiled approach for F32

* remove file

* Fix tests

* Support F16 operations

* add assert about size

* Review: further formatting fixes, add assert and use CPU version of fp32->fp16

* opencl : add GEGLU, REGLU, SWIGLU (#14456)

* ggml-quants : rename best_mad to best_error (ggml/1283)

This commit renames the variable `best_mad` to `best_error` in the
`make_qkx2_quants` function.

The motivation for this is that the name `best_mad` can be somewhat
confusing if mean absolute deviation (MAD) is not in use.

* ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)

* add "align corners" mode for bilinear upscale, and allow downscaling
* add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
* test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners

* sync : ggml

ggml-ci

* ggml : remove trailing whitespace (#0)

* add GELU_ERF (#14455)

* vulkan: Split large mul_mat_id to fit in shared memory (#14451)

* CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)

* [CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon <757486878@qq.com>

* Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon <757486878@qq.com>

* fix editorconfig

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>

* Add Vulkan images to docker.md (#14472)

Right now it's not easy to find those.

* ci : disable fast-math for Metal GHA CI (#14478)

* ci : disable fast-math for Metal GHA CI

ggml-ci

* cont : remove -g flag

ggml-ci

* ggml : Callback before abort (#14481)

* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.

* Return previous callback to allow callback chaining

* style fixes

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* github : add OpenCL backend to issue templates (#14492)

* ci : add OpenCL to labeler workflow (#14496)

* opencl : update upscale to support align corners (#14488)

* opencl : skip empty nodes on cgraph compute (#14491)

* simple-chat : fix context-exceeded condition (#14494)

* simple-chat : fix context-exceeded condition

ggml-ci

* cont : fix n_ctx_used computation

ggml-ci

* opencl : fix possible buffer overflow in dump_tensor (#14490)

* ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ggml-ci

* vulkan: support softmax/FA batch and broadcast (#14449)

* CUDA: broadcasting for FlashAttention mask (#14500)

* CUDA: add softmax broadcast (#14475)

* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output

* Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)

* ggml : add version function to get lib version (ggml/1286)

* ggml : add version function to get lib version

This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.

The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.

Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```

* ggml : add ggml_commit()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* sync : ggml

ggml-ci

* llama : initial Mamba-2 support (#9126)

* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size

* gguf-py : add support for chat template jinja files (#14508)

* add support for chat template jinja files

* remove gemma3n hack

* CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)

* ggml : remove kompute backend (#14501)

ggml-ci

* ggml : fix FA mask dim 2 and 3 (#14505)

* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1

* kv-cache : use ggml_set_rows (#14285)

* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

* convert : correct gemma 3n conversion (#14450)

* convert : correct gemma 3n conversion

* rm redundant code

* Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* ggml: backward pass for split swiglu (#14483)

* vulkan: support mixed/deepseekR1 FA head sizes (#14509)

* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes

* opencl : broadcast for soft_max (#14510)

* ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)

* CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)

Co-authored-by: luyuhong <luyuhong@kylinos.cn>

* batch : add n_used count (#14512)

ggml-ci

* graph : prepare for 4D mask (#14515)

ggml-ci

* batch : add optional for sequential equal split (#14511)

ggml-ci

* metal : disable fast math in all quantize kernels (#14528)

ggml-ci

* test-backend-ops: add support for specifying output format (#14368)

* test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* refactor

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* remove visitor nonsense

* remove visitor comment

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

* eval-callback : check for empty input (#14539)

* opencl: add GELU_ERF (#14476)

* server : fix assistant prefilling when content is an array (#14360)

* vulkan: Handle updated FA dim2/3 definition (#14518)

* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1

* vulkan: fix rms_norm+mul fusion (#14545)

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

* vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

* CUDA: add bf16 and i32 to getrows (#14529)

* llama : remove ggml_cont where possible (#14568)

* llama : fix incorrect minicpm3 v_states shape (#14571)

* musa: fix build warnings (unused variable) (#14561)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA: add bilinear interpolation for upscale (#14563)

* cuda : fix rope with partial rotation and non-cont src (#14580)

* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci

* vulkan: increase timeout for CI (#14574)

* model : add hunyuan moe (#14425)

* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>

* server: Add ability to mount server at prefix (#14544)

* Add server_prefix

* Correct server path env

* Rename cli flag to --api-prefix

* Change all to api_prefix

* vulkan : fix rope with partial rotation and non-cont src (#14582)

* memory : fix broken batch splits for recurrent cache (#14575)

Splits producing more than one ubatch per batch for recurrent models
were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.

* model : add SmolLM3 (#14581)

* Init - first pass.

* Model -> ModelBase.

* fix errors in conversion.

* Update the graph.

* up.

* up.

* wip

* cgraph ok

* rm redundant code

---------

Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>

* model : fix hunyuan moe chat template (#14584)

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* vulkan: optimize flash attention split_k_reduce (#14554)

* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).

* convert : fix smollm3 jinja template (#14586)

* model : add support for Falcon-H1 family (#14534)

* v1

* push more fixes

* another fix

* fix

* more fixes

* minor fix

* more cleaning on python code

* python fixes

* changed precision for multipliers float 32->64

* fixes

* another fix

* fix

* pre-norm -> norm

* fix

* Revert "fix"

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

* fix

* small fix ffn_norm

* try

* mix instead of max

* fix vocab size

* conflict solve

* fixed multipliers

* falcon-h1 specefic vocab resolved

* read arch from gguf.MODEL_ARCH

* mamba_d_ssm added to d_inner find_hparam

* remove unused functions from gguf_writer.py

* override modify_tensors instead of get_tensors

* fix conversion and d_inner

* added some cb functions for debugging puposes

* inp_out_ids moved outside of layers loop

* mup_vec create as float64

* fix rope_theta

* injected mup

* clean ups

* rm extra space

* rm unused MAMBA_CHUNK_SIZE

* rm unused key

* add bos False

* changed ROPE_TYPE

* cleaning debugging stuff

* cleaning debug quant

* fix comment

* some cleanups

* some cleanups

* Update src/llama-model-loader.cpp

* more cleanups

* moe cleanuips

* d_ssm -> d_inner;

* cleaning unused hparams

* cleanup

* more cleanups

* more cleanups on python conversion;

* minor cleanups

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* remove todo

* added falcon-h1

* tensor not required

* clean

* remove unneeded attributes

* more cleanups and fixed conversion

* remove final_norm

* flake8 fixes

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* flake8 fixes

* Update src/llama-hparams.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* added hashes

* Update src/llama-arch.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update the update file

* Revert "update the update file"

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

* fix: address suggestions

* fix: update convert_hf_to_gguf.py

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* d_inner fixed

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* reshaping ssm_norm for 34B

* removing generate_mup

* remove duplicates metadata keys

* rm comment

* final comment

* fix unused args

* fix constants

* fix bad merge

* Update src/llama-model.cpp

Co-authored-by: compilade <git@compilade.net>

* falcon-h1: remove unused ssm_in_b and bad merge

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* falcon-h1: fix last comment

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* falcon-h1: revert add_add_bos(False)

* falcon-h1: fix tied weights

* falcon-h1: remove whitespace

* falcon-h1: fix wrong size param

* falcon-h1: fix whitespace issues

---------

Co-authored-by: younesbelkada <younes.belkada@tii.ae>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: compilade <git@compilade.net>

* llama : remove unintended whitespace (#14592)

* model : add skt/A.X-4.0 model vocabulary (#14589)

* ggml : prevent integer overflow in gguf tensor size calculation (#14595)

* ggml : add ggml_scale_bias (#14417)

* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32

* llama : support Jamba hybrid Transformer-Mamba models (#7531)

* wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : use std::find for seq_nodes in llama_rs_cache

* llama : state checkpoints for recurrent models

* llama : correctly handle more edge cases for the rs cache

* llama : rename many llama_kv_cache_* functions

* llama : remove useless return value for some llama_cache_* functions

* llama : rethink recurrent state cell counts

* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot

* llama : support Jamba

* llama : fix BERT inference without KV cache

* convert-hf : check for unprocessed Jamba experts

* convert-hf : support Mini-Jamba conversion

* llama : fix Jamba quantization sanity checks

* llama : sequence-length-aware batch splitting

* llama : use equal-sequence-length sub-batches for recurrent models

* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch

* llama : fix batch split output count for embeddings

* llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.

* llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.

* llama : avoid copies for simple batch splits

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.

* llama : fix .base() compilation error on Windows

* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.

* mamba : fix non-contiguous usage of ggml_silu

* llama : session saving and reloading for hybrid models

* convert_hf : fix Jamba conversion

* llama : fix mixed signedness comparison

* llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

* llama : begin renaming llama_past back to llama_kv_cache

* llama : remove implicit recurrent state rollbacks

* llama : partially apply clang-format style

* convert : fix jamba conv1d shape squeezing

* graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).

* model : add Jamba to Mamba-specific hparams printing

* jamba : remove redundant nullptr initializations

* model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : make falcon-h1 use shared mamba2 layer builder

* memory : avoid referring to KV in recurrent cache logs

* gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* llama : remove llm_graph_input_one (#14603)

* cuda : support Falcon-H1 state size for SSM_SCAN (#14602)

* cmake : llguidance build parser library only (#14608)

* cmake : bump llguidance version to v1.0.1 (#14609)

* llama : minor coding style fix for smollm3 (#14605)

* SYCL: Initial set_rows kernel implementation (#14562)

* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE

* cmake : do not search for curl libraries by ourselves (#14613)

* cmake : do not search for curl libraries by ourselves

* run : do not search for curl libraries by ourselves

* Docs: script to auto-generate ggml operations docs (#14598)

* Docs: script to auto-generate ggml operations docs

* Review: formatting changes + change github action

* Use built-in types instead of typing

* docs : add BLAS and Metal ops

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Smoldocling support (#14597)

* support for smoldocling

* fixed merge conflicts

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>

* merge conflicts

* pre tokenizer merge fix

* convert : fix smollm3 jinja template (#14586)

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* support for smoldocling

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* fixed merge conflicts

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* safetensors tensor mapping

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* added back accidental removal of clean spaces for hunyuan

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* updated hash and reordererd model list

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update include/llama.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* removed old tensor name

* removed tensor mappings -> handled by smolvlm

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: compilade <git@compilade.net>

* opencl: add `set_rows` for `f16` and `f32` (#14547)

* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`

* opencl: add tiled mul_mat_f16_f32 (#14535)

* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments

* model : Granite Four (#13550)

* wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : use std::find for seq_nodes in llama_rs_cache

* llama : state checkpoints for recurrent models

* llama : correctly handle more edge cases for the rs cache

* llama : rename many llama_kv_cache_* functions

* llama : remove useless return value for some llama_cache_* functions

* llama : rethink recurrent state cell counts

* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot

* llama : support Jamba

* llama : fix BERT inference without KV cache

* convert-hf : check for unprocessed Jamba experts

* convert-hf : support Mini-Jamba conversion

* llama : fix Jamba quantization sanity checks

* llama : sequence-length-aware batch splitting

* llama : use equal-sequence-length sub-batches for recurrent models

* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch

* llama : fix batch split output count for embeddings

* llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.

* llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.

* llama : avoid copies for simple batch splits

* llama : use im2col and mul_mat to perform convolution for Mamba

This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.

* llama : fix .base() compilation error on Windows

* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.

* llama : rename llama_cache to llama_past

This can be changed back later if the name change is wrong.
I was renaming the functions anyway to generalize kv-cache-related
functions to hybrid and recurrent model architectures.
I think llama_past is a better name than llama_cache for a combined
kv cache and recurrent state cache, because the states it contains
pretty much always come before the newly-added ones for any particular
sequence. Also 'llama_past_clear' sounds more obvious in what it does
than 'llama_kv_cache_clear'. The future is what the models generate.
(For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

* examples : replace llama_kv_cache_seq_* with llama_past_seq_*

* mamba : fix non-contiguous usage of ggml_silu

* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : session saving and reloading for hybrid models

* convert_hf : fix Jamba conversion

* llama : fix mixed signedness comparison

* llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

* llama : begin renaming llama_past back to llama_kv_cache

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* llama : remove implicit recurrent state rollbacks

* llama : partially apply clang-format style

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* feat: Add conversion for Bamba models

This is borrowed and adapted from the original implementation
https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add Granite 4 conversion

This is a manual copy from my draft branch
https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb bamba through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add bamba to llama_arch_is_hybrid_recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add optional mamba ssm_in bias tensor

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add template specialization for get_arr to load a vector<uint32_t> for layer index arr in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use an explicit bool to determine mamaba vs mamba2

This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture `if` statement inside the mamba
implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Isolate mamba(2) and granite attention layer building in static methods

This will allow these layer-builder methods to be used from other build
structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer sizes in granite build_attention_layer

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First (broken) pass at end-to-end Bamba implementation

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Only do Granite multipliers if set

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Pull granite ffn portion into a static function and reuse in hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(py): Allow gguf duplicate keys if they match by value and type

This is helpful for hybrid models that want to do gguf param setting by
calling multiple parent classes without needing to make those parent
classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(py): Simplify granitemoehybrid conversion to use parents better

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add GRANITE_MOE_HYBRID through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Support GRANITE_MOE_HYBRID in llama-model

This re-uses the Bamba code paths heavily and simply adds the missing parts
for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix flake8 errors

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix recurrent cache get after rebase

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins

The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from `self->` everywhere, but this is still a
bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Implement the full copy-paste version to duplicate the layer builders

This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of `build_rs` /
`build_attn`. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid

I've got back-and-forth a lot about how/if to try to implement reuse of the
"child model" layer types for hybrid models. At the end of the day, I think
hybrid models are their own beast and even if their layers are inspired by
other models, they should maintain control of their own layer building (in
other words, the copy-paste method). Given that, the name should reflect
that this is not a generic hybrid model builder, but rather a granite-
specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts
at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* memory : correctly handle failure in apply()

ggml-ci

* style: Remove TODO for adding first hybrid models to the switch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* docs: Fix comment about duplicate key check

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Conform to standard way of initializing inp_out_ids

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* convert : fix jamba conv1d shape squeezing

* fix: Fix input initialization in granite_hybrid after removal of hybrid inputs

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use llm_graph_context_mamba in llm_build_granite_hybrid

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins

The key is for the mixin classes (llm_graph_context_mamba,
llm_graph_context_granite) to use virtual inheritance from
llm_graph_context. This allows the common members to exist only once in the
class hierarchy. The downside is that llm_graph_context will be
re-initialized once for each parent (ie 2x for single mixin, 3x for two
mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).

* model : add Jamba to Mamba-specific hparams printing

* fix: Fix input setup after upstream merge

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* jamba : remove redundant nullptr initializations

* model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Add support for dense FFN in GraniteMoeHybrid

This was already partially supported via reusing the granite ffn builder,
and there may be models that leverage this architecture going forward. The
naming is a bit odd, but in the transformers version, it reuses the same
model class and simply has zero regular experts and a single shared expert
(which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add support for dense FFN tensor names on c++ side

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use child inputs for Falcon H1 after merge resolution

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary prefix on tensor constants

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : make falcon-h1 use shared mamba2 layer builder

* memory : avoid referring to KV in recurrent cache logs

* fix: Revert order changes for Falcon H1 to stay consistent with upstream

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

* refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid

The only key difference is the use of rope which is now set via
rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove use of diamond inheritance

Per PR discussion, it's simpler to keep this with basic inheritance and not
introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Log mamba params for Granite Hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused ssm_in_b

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused template expansion for get_arr

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Review cleanup in convert_hf_to_gguf

The gist is to be explicit about which base class is being used with the
multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Undo hidden warnings about duplicate identical keys in add_key_value

After further discussion, this encourages sloppy overwriting in the model
converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: If not using ROPE, context is "infinite"

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* doc: Add a comment outlining expected duplicate key warnings

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary duplicate keys in converter

Co-authored-by: Francis Couture-Harpin <git@compilade.net>

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* vocab : add midm-2.0 model pre-tokenizer (#14626)

* llama : move enum llama_vocab_pre_type to implementation (#14631)

ggml-ci

* readme : add hot PRs (#14636)

* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont

* HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)

* model : support LiquidAI LFM2 hybrid family (#14620)

**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```

* vulkan: optimizations for deepseek prompt processing (#14555)

* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)

* vulkan: support SET_ROWS (#14587)

* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.

* server : fix pooled embedding output (#14645)

* vulkan : implement ggml_roll (ggml/1290)

ggml-ci

* vulkan : implement bilinear interpolation (ggml/1291)

ggml-ci

* sync : ggml

ggml-ci

* vulkan : remove unused vars (#0)

ggml-ci

* sync : ggml

* CUDA: add set rows for f32 and f16 (#14551)

* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity

* docs : add LFM2 to models section (#14650)

* readme : add LFM2 to models section

* fix copy paste...

* tests : cover lfm2 cases in test_ssm_conv (#14651)

* cmake : Add CMake presets for Linux and GCC (#14656)

* metal : Add missing unary ops Metal support (#14660)

* ggml : add build-time message to remind about ggml_set_rows (#14661)

ggml-ci

* cuda : add ELU support (#14657)

* cuda : add set rows for bf16 (#14664)

* quantize : fix minor logic flaw in --tensor-type (#14572)

* llama : add jinja template for rwkv-world (#14665)

* llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* sycl: Batched mulmat rework for oneDNN dispatch (#14617)

* SY…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hot Something that is hot python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Hunyuan-A13B model support