Support for dots.llm1 models #573

saood06 · 2025-07-03T02:22:51Z

It compiles. Testers welcome.

Edit: Tested myself a tiny bit (more testers still welcome), see comment below.

Huggingface link to models: instruct, base

saood06 · 2025-07-03T03:44:01Z

Tested a bit in the cli, seems to work.

Command:
./bin/llama-cli -m /mnt/sda/dots_inst/Dots_Inst-128x8.7B-BF16.gguf -s 12345 -p "The meaning of life" -t 48 --numa distribute -n 32 -c 8192 -fa

Output:

The meaning of life is to find your gift. The purpose of life is to give it away. — Willam James

This is as much as I had patience for, warmup seems to not actually load in all the experts and so tokens trickle in very slowly, not sure if that is the norm for CLI on MoE models (I know this isn't an issue for me with Deepseek models on server or sweep-bench).

I also noticed it is wrongly labeled as it says model ftype = IQ1_S - 1.5625 bpw even though it is a BF16, and found the issue. When I updated constants.py for LlamaFileType I used ggml.h instead of llama.h (only now realized that both have ftype info and they differ [not sure why?])

firecoperana · 2025-07-03T04:52:49Z

I am testing using UD-Q4_K_XL, and it is working. I notice an issue that if I leave system prompt empty, sometimes the response becomes unrelated to my question. With system prompt, it is fine. Do you also see this? I have the same issue when I run it from mainline.

saood06 · 2025-07-03T04:59:14Z

I am testing using UD-Q4_K_XL, and it is working.

Thanks.

I notice an issue that if I leave system prompt empty, sometimes the response becomes unrelated to my question. With system prompt, it is fine. Do you also see this? I have the same issue when I run it from mainline.

If it exists in mainline then maybe it is a problem with the model? I haven't seen it but I haven't tested the model further than my comment above.

src/llama.cpp

firecoperana · 2025-07-03T15:20:55Z

I also see that the response will pause for a few seconds whenever it generates a comma, which will more than half the generation speed. If I prompt it to avoid outputting comma in the response, I don't see any pause in response. Mainline does not have this issue because it does not output comma in the response.

Screenshot of the quant that I use:

BOS token is ",", which should be changed to -1 according to this post:
https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1

saood06 · 2025-07-03T22:55:12Z

I also see that the response will pause for a few seconds whenever it generates a comma, which will more than half the generation speed. If I prompt it to avoid outputting comma in the response, I don't see any pause in response. Mainline does not have this issue because it does not output comma in the response.

Interesting, you are using Q4_K_XL. There is a lot of reporting about issues with certain quants of some Qwen based models (and this is one of those) pausing whenever they encounter a comma.

2 users here who narrow it down to certain quants of some Qwen based models:
#464 (comment)
#464 (comment)

2 users here who identify it happening with commas, and causing performance issues:
#476 (comment)
#476 (comment) (this one even shows the effect on video)

The first sighting on the github I know about:
#380 (comment)

I'm not sure what the root cause is, but I wouldn't investigate it with this model, I think the smallest model it is reported on is Qwen3-30B-A3B-128K-UD-Q4_K_XL.

firecoperana · 2025-07-03T23:30:53Z

The following fix works for me:

Not sure if there is better way.

saood06 · 2025-07-04T00:05:25Z

Not sure if there is better way.

That fix is only for the incorrect BOS token (not the comma's causing pausing, right?), which to me seems like an issue with existing models caused by the convert script which is where the fix should happen (with workarounds like [this](https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1 for existing models) .

Both the config.json and tokenizer_config.json are set to null, which makes it take the default, but that doesn't seem to be correct for this model at least.

firecoperana · 2025-07-04T00:10:41Z

Without the fix, the model uses comma as BOS token that causes the pause, as least for the quant I'm using. See the screenshot I posted. Id 11 is the comma. After I set to null, comma is not used as BOS token.

saood06 · 2025-07-04T00:24:53Z

Without the fix, the model uses comma as BOS token that causes the pause, as least for the quant I'm using. See the screenshot I posted. Id 11 is the comma. After I set to null, comma is not used as BOS token.

Well the comma still causes a pause (I'm assuming) even if you avoid encountering it from the BOS token by setting the BOS token.

I've seen the screenshot you posted, and I also see the wrong BOS token (BOS token = 11 ',') in my own GGUF that I converted as part of the testing here (from safetensors to BF16 GGUF).

Using --override-kv tokenizer.ggml.bos_token_id=int:-1 like you linked above fixes it for affected models, but for future models to not be affected I think the convert script needs to explicitly set it, without changing the default like the llama.cpp change you suggested does.

ikawrakow · 2025-07-09T08:31:56Z

@saood06 What are your plans with this PR? You are disagreeing with the V reshaping comment, or is it about the BOS token, or perhaps both?

… handle models without BOS

saood06 · 2025-07-09T17:45:47Z

@saood06 What are your plans with this PR?

Sorry kept pushing off testing this more, but I just pushed a commit with both the recommended changes.

I tested all four -fa and -fmoe combinations and it works, (without the V cur changes, non FA was outputting garbage).

You are disagreeing [...] about the BOS token

I still think the better solution would have been for the convert script to set it to -1 when config.json has it set to NULL instead of leaving it to be set to default and changing the default for this architecture, but given the fact that every GGUF I saw on huggingface has this issue, changing the default so that users don't have to set --override-kv tokenizer.ggml.bos_token_id=int:-1 (assuming they know to do that) or some other workaround to use existing GGUFs makes sense.

I also changed the warmup behavior to work with this model (a MoE without a BOS token), it is still the same hacky solution but now it does account for models without a BOS token, and it did warmup properly for me now (not sure why it wasn't with BOS set to [token id 11/,]).

Edit: Also handled the merge conflicts.

saood06 added 3 commits July 2, 2025 20:57

Add llama.cpp changes for dots1 support

c834b6d

Add python changes for dots1 support

eb88e8f

Fix to make it convert

b5bc8dc

saood06 requested a review from ikawrakow July 3, 2025 04:59

ikawrakow requested changes Jul 3, 2025

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

Remove V reshaping, remove BOS by default for dots1 and fix warmup to…

692dc0d

… handle models without BOS

saood06 added 3 commits July 9, 2025 13:28

Merge branch 'main' into s6/dots

46af019

Minor fix

ffd835b

Remove commented lines

660c33f

ikawrakow approved these changes Jul 10, 2025

View reviewed changes

saood06 merged commit c53cb65 into main Jul 10, 2025

samteezy mentioned this pull request Jul 22, 2025

Bug: The streaming every couple of rows blocks for 5-8s #464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for dots.llm1 models #573

Support for dots.llm1 models #573

saood06 commented Jul 3, 2025 •

edited

Loading

Uh oh!

saood06 commented Jul 3, 2025 •

edited

Loading

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 3, 2025

Uh oh!

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 3, 2025 •

edited

Loading

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 4, 2025 •

edited

Loading

Uh oh!

firecoperana commented Jul 4, 2025

Uh oh!

saood06 commented Jul 4, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Jul 9, 2025

Uh oh!

saood06 commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Support for dots.llm1 models #573

Support for dots.llm1 models #573

Conversation

saood06 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saood06 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 3, 2025

Uh oh!

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Jul 3, 2025

Uh oh!

saood06 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Jul 4, 2025

Uh oh!

saood06 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jul 9, 2025

Uh oh!

saood06 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

saood06 commented Jul 3, 2025 •

edited

Loading

saood06 commented Jul 3, 2025 •

edited

Loading

saood06 commented Jul 3, 2025 •

edited

Loading

saood06 commented Jul 4, 2025 •

edited

Loading

saood06 commented Jul 4, 2025 •

edited

Loading

saood06 commented Jul 9, 2025 •

edited

Loading