-
Notifications
You must be signed in to change notification settings - Fork 130
Support for dots.llm1 models #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tested a bit in the cli, seems to work. Command: Output: The meaning of life is to find your gift. The purpose of life is to give it away. — Willam James This is as much as I had patience for, warmup seems to not actually load in all the experts and so tokens trickle in very slowly, not sure if that is the norm for CLI on MoE models (I know this isn't an issue for me with Deepseek models on server or sweep-bench). I also noticed it is wrongly labeled as it says |
I am testing using UD-Q4_K_XL, and it is working. I notice an issue that if I leave system prompt empty, sometimes the response becomes unrelated to my question. With system prompt, it is fine. Do you also see this? I have the same issue when I run it from mainline. |
Thanks.
If it exists in mainline then maybe it is a problem with the model? I haven't seen it but I haven't tested the model further than my comment above. |
I also see that the response will pause for a few seconds whenever it generates a comma, which will more than half the generation speed. If I prompt it to avoid outputting comma in the response, I don't see any pause in response. Mainline does not have this issue because it does not output comma in the response. Screenshot of the quant that I use: BOS token is ",", which should be changed to -1 according to this post: |
Interesting, you are using 2 users here who narrow it down to certain quants of some Qwen based models: 2 users here who identify it happening with commas, and causing performance issues: The first sighting on the github I know about: I'm not sure what the root cause is, but I wouldn't investigate it with this model, I think the smallest model it is reported on is |
That fix is only for the incorrect BOS token (not the comma's causing pausing, right?), which to me seems like an issue with existing models caused by the convert script which is where the fix should happen (with workarounds like [this](https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1 for existing models) . Both the |
Without the fix, the model uses comma as BOS token that causes the pause, as least for the quant I'm using. See the screenshot I posted. Id 11 is the comma. After I set to null, comma is not used as BOS token. |
Well the comma still causes a pause (I'm assuming) even if you avoid encountering it from the BOS token by setting the BOS token. I've seen the screenshot you posted, and I also see the wrong BOS token ( Using |
@saood06 What are your plans with this PR? You are disagreeing with the |
… handle models without BOS
Sorry kept pushing off testing this more, but I just pushed a commit with both the recommended changes. I tested all four
I still think the better solution would have been for the convert script to set it to I also changed the warmup behavior to work with this model (a MoE without a BOS token), it is still the same hacky solution but now it does account for models without a BOS token, and it did warmup properly for me now (not sure why it wasn't with BOS set to [token id 11/ Edit: Also handled the merge conflicts. |
Port of ggml-org/llama.cpp#14118
It compiles. Testers welcome.
Edit: Tested myself a tiny bit (more testers still welcome), see comment below.
Huggingface link to models: instruct, base