-
Notifications
You must be signed in to change notification settings - Fork 13k
Closed
Labels
bugSomething isn't workingSomething isn't workingnvidiaIssues relating to Nvidia GPUs and CUDAIssues relating to Nvidia GPUs and CUDA
Description
What is the issue?
When using the generate endpoint it intermittently misses the last token right before the "done" message
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.463348938Z","response":" Bear","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.475993178Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.488651949Z","response":" Elephant","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.50131158Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.51400078Z","response":" Gor","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.539481043Z","response":"","done":true,"done_reason":"stop","total_duration":8790953777,"load_duration":8080650494,"
In the above example the token that should be the end of "Gorilla" is not emitted before the done response and we just get "Gor".
here's the curl command to reproduce this
curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/a
pi/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nYou will think of a number. Then you will list that many animals. Do not write any other words only
the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user
:", "assistant:"], "num_batch": 1, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}'
I have only seen this with one model so far (adrienbrault/nous-hermes2theta-llama3-8b:q8_0) so the model may well be a factor however I don't get this problem with the chat endpoint for that model but I do get it with the generate endpoint. I'm using raw mode and stream=true
OS
Linux
GPU
Nvidia
CPU
AMD
Ollama version
0.3.9
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingnvidiaIssues relating to Nvidia GPUs and CUDAIssues relating to Nvidia GPUs and CUDA