Skip to content

Generate endpoint intermittently misses final token before done #6707

@tarbard

Description

@tarbard

What is the issue?

When using the generate endpoint it intermittently misses the last token right before the "done" message

{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.463348938Z","response":" Bear","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.475993178Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.488651949Z","response":" Elephant","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.50131158Z","response":",","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.51400078Z","response":" Gor","done":false}
{"model":"adrienbrault/nous-hermes2theta-llama3-8b:q8_0","created_at":"2024-09-09T08:04:47.539481043Z","response":"","done":true,"done_reason":"stop","total_duration":8790953777,"load_duration":8080650494,"

In the above example the token that should be the end of "Gorilla" is not emitted before the done response and we just get "Gor".

here's the curl command to reproduce this

curl -H 'Host: 127.0.0.1:11434' -H 'Content-Type: application/json' -H 'Connection: Keep-Alive' --compressed -H 'Accept-Language: en-GB,*' -H 'User-Agent: Mozilla/5.0' -X POST http://127.0.0.1:11434/a
pi/generate -d '{"model": "adrienbrault/nous-hermes2theta-llama3-8b:q8_0", "prompt": "\n<|im_start|>user\nYou will think of a number. Then you will list that many animals. Do not write any other words only 
the animal. Be terse in your response.<|im_end|>\n<|im_start|>assistant", "raw": true, "stream": true, "keep_alive": -1, "options": {"seed": 99, "num_predict": 1024, "num_ctx": 4096, "stop": ["<end>", "user
:", "assistant:"], "num_batch": 1, "temperature": 0.5, "top_k": 40, "top_p": 0.9}}'

I have only seen this with one model so far (adrienbrault/nous-hermes2theta-llama3-8b:q8_0) so the model may well be a factor however I don't get this problem with the chat endpoint for that model but I do get it with the generate endpoint. I'm using raw mode and stream=true

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.9

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingnvidiaIssues relating to Nvidia GPUs and CUDA

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions