Releases: ollama/ollama
v0.11.8
What's Changed
gpt-oss
now has flash attention enabled by default for systems that support it- Improved load times for
gpt-oss
Full Changelog: v0.11.7...v0.11.8
v0.11.7
DeepSeek-V3.1
DeepSeek-V3.1 is now available to run via Ollama.
This model supports hybrid thinking, meaning thinking can be enabled or disabled by setting think
in Ollama's API:
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-v3.1",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"think": true
}'
In Ollama's CLI, thinking can be enabled or disabled by running the /set think
or /set nothink
commands.
Turbo (in preview)
DeepSeek-V3.1 has over 671B parameters, and so a large amount of VRAM is required to run it. Ollama's Turbo mode (in preview) provides access to powerful hardware in the cloud you can use to run the model.
Turbo via Ollama's app
- Download Ollama for macOS or Windows
- Select
deepseek-v3.1:671b
from the model selector - Enable Turbo
Turbo via Ollama's CLI and libraries
- Create an account on ollama.com/signup
- Follow the docs for Ollama's CLI to upload authenticate your Ollama installation
- Run the following:
OLLAMA_HOST=ollama.com ollama run deepseek-v3.1
For instructions on using Turbo with Ollama's Python and JavaScript library, see the docs
What's Changed
- Fixed issue where multiple models would not be loaded on CPU-only systems
- Ollama will now work with models who skip outputting the initial
<think>
tag (e.g. DeepSeek-V3.1) - Fixed issue where text would be emitted when there is no opening
<think>
tag from a model - Fixed issue where tool calls containing
{
or}
would not be parsed correctly
New Contributors
- @zoupingshi made their first contribution in #12028
Full Changelog: v0.11.6...v0.11.7
v0.11.6
What's Changed
- Ollama's app will now switch between chats faster
- Improved layout of messages in Ollama's app
- Fixed issue where command prompt would show when Ollama's app detected an old version of Ollama running
- Improved performance when using flash attention
- Fixed boundary case when encoding text using BPE
Full Changelog: v0.11.5...v0.11.6
v0.11.5
What's Changed
- Performance improvements for the
gpt-oss
models - New memory management: this release of Ollama includes improved memory management for scheduling models on GPUs, leading to better VRAM utilization, model performance and less out of memory errors. These new memory estimations can be enabled with
OLLAMA_NEW_ESTIMATES=1 ollama serve
and will soon be enabled by default. - Improved multi-GPU scheduling and reduced VRAM allocation when using more than 2 GPUs
- Ollama's new app will now remember default selections for default model, Turbo and Web Search between restarts
- Fix error when parsing bad harmony tool calls
OLLAMA_FLASH_ATTENTION=1
will also enable flash attention for pure-CPU models- Fixed OpenAI-compatible API not supporting
reasoning_effort
- Reduced size of installation on Windows and Linux
New Contributors
- @vorburger made their first contribution in #11755
- @dan-and made their first contribution in #10678
- @youzichuan made their first contribution in #11880
Full Changelog: v0.11.4...v0.11.5
v0.11.4
v0.11.3
What's Changed
- Fixed issue where
gpt-oss
would consume too much VRAM when split across GPU & CPU or multiple GPUs - Statically link C++ libraries on windows for better compatibility
Full Changelog: v0.11.2...v0.11.3
v0.11.2
What's Changed
- Fix crash in gpt-oss when using kv cache quanitization
- Fix gpt-oss bug with "currentDate" not defined
Full Changelog: v0.11.1...v0.11.2
v0.11.0
Welcome OpenAI's gpt-oss models
Ollama partners with OpenAI to bring its latest state-of-the-art open weight models to Ollama. The two models, 20B and 120B, bring a whole new local chat experience, and are designed for powerful reasoning, agentic tasks, and versatile developer use cases.
Feature highlights
- Agentic capabilities: Use the models’ native capabilities for function calling, web browsing (Ollama is providing a built-in web search that can be optionally enabled to augment the model with the latest information), python tool calls, and structured outputs.
- Full chain-of-thought: Gain complete access to the model's reasoning process, facilitating easier debugging and increased trust in outputs.
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
- Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
Quantization - MXFP4 format
OpenAI utilizes quantization to reduce the memory footprint of the gpt-oss models. The models are post-trained with quantization of the mixture-of-experts (MoE) weights to MXFP4 format, where the weights are quantized to 4.25 bits per parameter. The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the smaller model to run on systems with as little as 16GB memory, and the larger model to fit on a single 80GB GPU.
Ollama is supporting the MXFP4 format natively without additional quantizations or conversions. New kernels are developed for Ollama’s new engine to support the MXFP4 format.
Ollama collaborated with OpenAI to benchmark against their reference implementations to ensure Ollama’s implementations have the same quality.
Get started
You can get started by downloading the latest Ollama version (v0.11)
The model can be downloaded directly in Ollama’s new app or via the terminal:
ollama run gpt-oss:20b
ollama run gpt-oss:120b
What's Changed
- kvcache: Enable SWA to retain additional entries by @jessegross in #11611
- kvcache: Log contents of cache when unable to find a slot by @jessegross in #11658
Full Changelog: v0.10.1...v0.11.0
v0.10.1
What's Changed
- Fixed unicode character input for Japanese and other languages in Ollama's new app
- Fixed AMD download URL in the logs for
ollama serve
New Contributors
- @skools-here made their first contribution in #11579
Full Changelog: v0.10.0...v0.10.1
v0.10.0
Ollama's new app
Ollama's new app is available for macOS and Windows: Download Ollama
What's Changed
ollama ps
will now show the context length of loaded models- Improved performance in
gemma3n
models by 2-3x - Parallel request processing now defaults to 1. For more details, see the FAQ
- Fixed issue where tool calling would not work correctly with
granite3.3
andmistral-nemo
models - Fixed issue where Ollama's tool calling would not work correctly if a tool's name was part of of another one, such as
add
andget_address
- Improved performance when using multiple GPUs by 10-30%
- Ollama's OpenAI-compatible API will now support WebP images
- Fixed issue where
ollama show
would report an error ollama run
will more gracefully display errors
New Contributors
- @sncix made their first contribution in #11189
- @mfornet made their first contribution in #11425
- @haiyuewa made their first contribution in #11427
- @warting made their first contribution in #11461
- @ycomiti made their first contribution in #11462
- @minxinyi made their first contribution in #11502
- @ruyut made their first contribution in #11528
Full Changelog: v0.9.6...v0.10.0