koboldcpp-1.98.1

@mmwillet

koboldcpp-1.98.1

Kokobold edition

kobo.mp4

NEW: TTS.cpp model support has been integrated into KoboldCpp, providing access to new Text-To-Speech models - The TTS.cpp project (repo here) was developed by by @mmwillet, and a modified version has now been added into KoboldCpp, bringing support for 3 new Text-To-Speech models Kokoro, Parler and Dia.
- Of the above models, Kokoro is the most recommended for general use.
- Uses the GGML library in KoboldCpp, although the new ops are CPU only, so Kokoro provides the best speed taking size into consideration. You can expect speeds of 2x realtime for Kokoro (fastest), 0.5x realtime for Parler, and 0.1x realtime for Dia (slowest).
- To use, simply download the GGUF model and load it in the 'Audio' tab as a TTS model. Note: WavTokenizer is not required for these models. Please use the no_espeak versions, KoboldCpp has custom IPA mappings for English and espeak is not supported.
- KoboldAI Lite provides automatic mapping for the speaker voices. If you wish to use a custom voice for Kokoro, the supported voices are af_alloy, af_aoede, af_bella, af_heart, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky, am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa, bf_alice, bf_emma, bf_isabella, bf_lily, bm_daniel, bm_fable, bm_george, bm_lewis. Only English speech is properly supported.
Thanks to @wbruna, image generation has been updated and received multiple improvements:
- Added separate flash attention and conv2d toggles for image generation --sdflashattention and --sdconvdirect
- Added ability to use q8 for Image Generation model quantization, in addition to existing q4. --sdquant now accepts a parameter [0/1/2] that specifies quantization level, similar to --quantkv
Added --overridenativecontext flag which allows you to easily override the expected trained context of a model when determining automatic RoPE scaling. If you didn't get that, you don't need this feature.
Seed-OSS support is merged, including instruct templates for thinking and non-thinking modes.
Further improvements to tool calling and audio transcription handling
Fixed Stable Diffusion 3.5 loading issue
Embedding models now default to the lower of current model max context and trained context. Should help with Qwen3 embedding models. This can be adjusted with --embeddingsmaxctx override.
Improve server identifier header for better compatibility with some libraries
Termux android_install.sh script can now launch existing downloaded models
Minor chat adapter fixes, including Kimi.
Added alias for --tensorsplit
Benchmark CSV formatting fix.
Updated Kobold Lite, multiple fixes and improvements
- Scenario picker can now load any adventure or chat scenario in Instruct mode.
- Slightly increased default amount to generate.
- Improved file saving behavior, try to remember previously used filename.
- Improved KaTeX rendering and handle additional cases
- Improved streaming UI for code block streaming at the start of any turn.
- Added setting to embed generated TTS audio into the context as part of the AI's turn.
- Minor formatting fixes
- Added Vision 👁️ and Auditory 🦻 support indicators for inline multimodal media content.
- Added Seed-OSS instruct templates. Note that Thinking regex must be set manually for this model by changing the think tag.
- Overhaul narration and media adding system, allow TTS to be manually added with Add File.
Merged new model support, fixes and improvements from upstream

Hotfix 1.98.1 - Fix Kokoro for better accuracy and quality, added 4096 as a --blasbatchsize option, fix windows 7 functionality, fixed flash attention issues, synced some new updates from upstream.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.97.4

Merged support for GLM4.5 family of models
Merged support for GPT-OSS models (note that this model performs poorly if OpenAI instruct templates are not obeyed. To use it in raw story mode, append <|start|>assistant<|channel|>final<|message|> to memory)
Merged support for Voxtral (Voxtral Small 24B is better than Voxtral Mini 3B, but both are not great. See ggml-org#14862 (comment))
Added /ping stub endpoint to permit usage on Runpod serverless.
Allow MoE layers to be easily kept on CPU with --moecpu (layercount) flag. Using this flag without a number will keep all MoE layers on CPU.
Clearer indication of support for each multimodal modality Vision/Audio
Increased max length of terminal prints allowed in debugmode.
Do not attempt context shifting for any mrope models.
Adjusted some adapter instruct templates, tweaked mistral template.
Handle empty objects returned by tool calls, also remove misinterpretation of the tools calls instruct tag within ChatML autoguess.
Allow multiple tool calls to be chained, and allow them to be triggered by any role.
WebSearch fix url params parsing
Increased regex stack size limit for MSVC builds (fix for mistral models).
Updated Kobold Lite, multiple fixes and improvements
- Added 2 more save slots
- Added a (+/-) modifier field for Adventure mode rolls
- Fixed deleting wrong image if multiple selected images are identical.
- Button to insert textDB separator
- Improved mid-streaming rendering
- Slightly lowered default rep pen
- Simplified Mistral template, added GPT-OSS Harmony template
Merged new model support, fixes and improvements from upstream

Hotfix 1.97.1 - More template fixes, now shows generated token's ID in debugmode terminal log, fixed flux loading speed regression, Vulkan BSOD fixed.
Hotfix 1.97.2 - Fix CLBlast regression, limit vulkan bsod fix to nvidia only, updated lite, merged upstream fixes.
Hotfix 1.97.3 - Fix a regression with GPT-OSS that resulted in incoherence
Hotfix 1.97.4 - Fixed OldPC CUDA builds when flash attention was not used. This broke after 1.95 and is now fixed.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@Reithan

koboldcpp-1.96.2

NEW: Now supports audio inputs for models (in addition to existing vision inputs). Specifically, support for Qwen 2.5 Omni 3B has been added (the 3B is better than the 7B which cannot understand music).
- Use it similar to existing vision models - you download the base model and then the mmproj and load both.
- You can then launch KoboldCpp, and upload your images/audio in the KoboldAI Lite UI, and ask the AI questions about them.
- Multiple images and audio files can be used together, though be aware that you will need a high context especially for large audio files.
- The 3B seems to perform better than the 7B. The 7B hallucinates on music very hard.
Added miniaudio: .wav, .mp3 and .flac files are now supported on all audio endpoints (Whisper transcribe and multimodal audio)
Fixes for gemma3n incoherence, should be working out of the box now.
Fixes to allow the new Jamba 1.7 models to work. Note that context shift and fast forwarding cannot be used on Jamba.
Allow automatically resuming incomplete model downloads if aria2c is used.
Prints some system information on startup to terminal to aid future debugging
Added emulation for OpenAI /v1/images/generations endpoint for image generation
Fixed noscript image generation
Apply nsigma masking (thanks @Reithan)
Allow flash attention to be used with image generation (thanks @wbruna)
Backwards compatibility for json_schema field improved.
Ensured that finish_reason is always sent last with no additional text content on the same chunk.
Important Change: Default context size is now 8k (up from 4k) to better represent modern models. This may affect your memory usage. Existing kcpps configs are unaffected.
Important Change: The flag --usecublas has been renamed to --usecuda. Backwards compatibility for the old flag name is retained, but you're recommended to change to the new name.
Added new AutoGuess templates for Kimi K2, Jamba and Dots. Hunyuan A13B template is not included as the ideal template cannot be determined.
Improved formatting of multimodal chunk handling
Fixes for remotetunnel not starting on some linux systems.
Updated Kobold Lite, multiple fixes and improvements
- Aesthetic UI has been completely refactored and slightly simplified for easier management. Most functionality should be unchanged.
- Allow connecting to OpenAI endpoints without a key.
- Added more experimental flags to control audio compression, autoguess tags and unsaved file warnings.
- Allow uploading audio files and embedding them into your saved stories, lamejs mp3 encoder added.
- Allow audio capture from microphone to embed into story
- Added shortcut for inserting instructions into memory
- Allow disabling default stop sequences.
- Breaking Change: Attached image and audio data is no longer stored inline in the story, but instead as metadata in the savefile
  - Save files from past versions are 100% forwards compatible, but any new media files in future saves are only partially backwards compatible - all media saved in future versions will not be accessible when re-opened in past versions of the UI.
  - This is required to handle the large size of audio files. All old savefiles will upgrade perfectly fine, but you can't add new media and then access it back in old versions again.
- Fixed a few html parsing bugs.
Merged new model support, fixes and improvements from upstream

Hotfix 1.96.1 - Fixed a few UI issues, fixed loading large multipart models, adjusted autoguess templates by @kallewoof, merged exaone 4 support

Hotfix 1.96.2 - Splits a batch into smaller batches when processing if it fails, updated lite with a few minor fixes, increase max img2img size

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@stduhpf

koboldcpp-1.95.1

NEW: Added support for Flux Kontext: This is a powerful image editing model based on Flux that can edit images using natural language. Easily replace backgrounds, edit text, or add extra items into your images. You can download a ready-to-use kcppt template here, simply load it into KoboldCpp and all necessary model files will be downloaded on launch. Then open StableUI at http://localhost:5001/sdui, add your prompt, reference images and generate. Thanks to @stduhpf for the sd.cpp implementation!
Photomaker now supports uploading multiple reference images, same as Kontext. Up to 4 reference images are accepted.
Merged upstream support and added AutoGuess template for Gemma3n (text only) and ERNIE.
Further grammar sampling speedups from caching by @Reithan
Fixed a bug when combining save states with draft models.
Fixed an issue where prompt processing encountered errors after the KV refactor
Fixed support for python 3.13 (thanks @tsite)
Updated Kobold Lite, multiple fixes and improvements
- Fixed Push-to-Talk on mobile, added Toggle to Talk (voice input) option.
- Improved some error handling for aborted streaming
- Fixed some linebreaks in corpo chat mode
- Fixed a bug in thinking regex
Merged new model support, fixes and improvements from upstream

Hotfix 1.95.1 - Fixed error when using swa together with flash attention.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@henk717

koboldcpp-1.94.2

are we comfy yet?

NEW: Added unpacked mini-launcher: Now when unpacking KoboldCpp to a directory, a 5MB mini pyinstaller launcher is also generated in that same directory, that allows you to easily start an unpacked KoboldCpp without needing to install python or other dependencies. You can copy the unpacked directory and use it anywhere (thanks @henk717)
NEW: Chroma Image Generation Support: Merged support for the Chroma model, a new architecture based on Flux Schnell (thanks @stduhpf)
- This model also requires a T5-XXL encoder and Flux VAE to work, be sure to load all 3 files respectively!
- Chroma requires descriptive prompts and negative prompts to work well! Simple prompts will produce poor results.
NEW: Added PhotoMaker Face Cloning Use --sdphotomaker to load PhotoMaker along with any SDXL based model. Then open KoboldCpp SDUI and upload any reference image in the PhotoMaker input to clone the face! Works in all modes (inpaint/img2img/text2img).
Swapping .gguf models in admin mode now allows overriding the config with a different one as well (both are customizable).
Improve GNBF grammar performance by attempting culled grammar search first (thanks @Reithan)
Allow changing the main GPU with --maingpu when loading multi-gpu setups. The main GPU uses more VRAM and has a larger performance impact. By default it is the first GPU.
Added configurable soft resolution limits and VAE tiling limits (thanks @wbruna), also fixed VAE tiling artifacts.
- Added --sdclampedsoft which provides "soft" total resolution clamping instead.(e.g. 640 would allow 640x640, 512x768 and 768x512 images), can be combined with --sdclamped which provides hard clamping (no dimension can exceed it)
- Added --sdtiledvae which replaces --sdnotile: Allows specifying a size beyond which VAE tiling is applied.
Use --embeddingsmaxctx to limit the max context length for embedding models (if you run out of memory, this will help)
Added --embeddingsgpu to allow offloading embeddings model layers to GPU. This is NOT recommended as it doesn't provide much speedup, since embedding models already use the GPU for processing even without dedicated offload.
Display available RAM on startup, display version number in terminal window title
ComfyUI emulation now covers the /upload/image endpoint which allows Img2Img comfyui workflows. Files are stored temporarily in memory only.
Added more performance stats for token speeds and timings.
Updated Kobold Lite, multiple fixes and improvements
- Fixed Chub.ai importer again
- Added card importer for char-archive.evulid.cc
- Added option to import image from webcam
- Allow markdown when streaming current turn
- Improved CSS import sanitizer (thanks @PeterPeet)
- Word Frequency Search (inspired from @trincadev MyGhostWriter)
- Allow usermods and CSS to be loaded from file.
- Added WebSearch for corpo mode
- Added Img2Img support for ComfyUI backends
- Added ability to use custom OpenAI endpoint for TextDB embedding model
- Minor linting and splitter/merge tool by @ehoogeveen-medweb
- Fixed lookahead scanning for Author's note insertion point
Merged new model support, fixes and improvements from upstream

Hotfix 1.94.1 - Minor bugfixes, fixed ollama compatible vision, added avx/avx2 detection for backend auto-selection, cleaned up oldpc builds to only include oldpc files.

Hotfix 1.9.2 - Fixed an issue with swa models when context is full, try to fix a vulkan oom regression

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Deprecation Reminder: Binary filenames have been renamed: The files named koboldcpp_cu12.exe, koboldcpp_oldcpu.exe, koboldcpp_nocuda.exe, koboldcpp-linux-x64-cuda1210, and koboldcpp-linux-x64-cuda1150 have been removed. Please switch to the new filenames.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@henk717

koboldcpp-1.93.2

those left behind

NEW: Added Windows Shell integration. You can now associate .gguf files to open automatically in KoboldCpp (e.g. double clicking a gguf). If another kcpp instance is already running locally on the same port, it will be replaced. The default handler can be installed/uninstalled from the 'Extras' tab (thanks @henk717)
- This is handled by the /api/extra/shutdown api, which can only be triggered from localhost.
- Will not affect instances started without --singleinstance flag. All this is automatic when you launch via windows shell integration.
NEW: Added an option to simply unload a model from the admin API, the server will free the memory but continue to run. You can then switch to a different model via the admin panel in Lite.
NEW: Added Save and Load States (sessions). This allows you to take a Savestate Snapshot of the current context, and then reload it again later at any time. Available over the admin API, you can trigger it from the admin panel in Lite.
- Works similarly to 'session files' in llama.cpp, but the snapshot states are stored entirely in memory.
- Used correctly, it can allow you to swap between multiple different sessions/chats without any reprocessing at all.
- There are 3 available slots to use (total 4 including the current session).
Fixed a regression with flash attention not working for some GPUs in the previous version.
Added a text LoRA scale option. Removed text LoRA base as it was no longer used in modern ggufs. If provided it will be silently ignored.
Function/Tool calling can now use higher temperatures (up to 1.0)
Added more Ollama compatibility endpoints.
Fixed a few clip skip issues in image generation.
Added an adapter flag add_sd_step_limit to limit max image generation step counts.
Fixed crash on thread count 0.
Match a few common openai tts voice ids
Fixed a ctx bug with embeddings (still does not work with qwen3 embed, but should work with most others)
KoboldCpp Colab now uses KoboldCpp's internal downloader instead of downloading the models first externally.
Updated Kobold Lite, multiple fixes and improvements
- Added support for embeddings models into KoboldAI Lite's TextDB (thanks @esolithe)
- Added support for saving and loading world info files independently (thanks @esolithe)
- NEW: Added new "Smart" Image Autogeneration mode. This allows the AI to decide when it should generate images, and create image prompt automatically.
- Added a new scenario: Replaced defunct aetherroom.club with prompts.forthisfeel.club
- Added support for importing cards from character-tavern.com
- Improved Tavern World Info support
- Added support for welcome messages in corpo mode.
- Fixed copy to clipboard not working for some browsers.
- Interactive Storywriter scenario fix: now no longer overwrites your regex settings. However, hiding input text is now off by default.
- Added a toggle to make a usermod permanent. Use with caution.
- Markdown fixes, also prevent your username from being overwritten when changing chat scenario.
Merged fixes and improvements from upstream

Hotfix 1.93.1 - Fixed a crash due to outdated VC runtime dlls, fixed a bad adapter, added base64 embeddings support, added webcam upload support for KoboldAI Lite Add Image, fixed chubai importer, added more options for idle response trigger times.

Hotfix 1.93.2 - Revert back to VS2019+cuda12.1 for windows build to solve reports of crashes. Fixed issues with embeddings endpoint. Added --embeddingsmaxctx option.

Important Breaking Changes (File Naming Change Notice):

For improved clarity and ease of use, many binaries are being RENAMED.
Please observe the new name changes for your automated scripts to avoid disruption:
Linux:
- koboldcpp-linux-x64-cuda1210 is now koboldcpp-linux-x64 (Cuda12, AVX2, Newer PCs)
- koboldcpp-linux-x64-cuda1150 is now koboldcpp-linux-x64-oldpc (Cuda11, AVX1, Older PCs)
- koboldcpp-linux-x64-nocuda is still koboldcpp-linux-x64-nocuda (No CUDA)
Windows:
- koboldcpp_cu12.exe is now koboldcpp.exe (Cuda12, AVX2, Newer PCs)
- koboldcpp_oldcpu.exe is now koboldcpp-oldpc.exe (Cuda11, AVX1, Older PCs)
- koboldcpp_nocuda.exe is now koboldcpp-nocuda.exe (No CUDA)
If you are using our official URLs or docker images, this should be handled automatically, but ensure your docker image is up-to-date.
If you are using platforms that do not support the main build, you can continue using the oldpc builds, which remain on cuda11 and avx1 and will continue to be maintained. The cuda12+ version on the main build may be subject to change in future.
For now, both filenames are uploaded to avoid breaking existing scripts. The old filenames will be removed soon, so please update.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.

Deprecation Warning: The files named koboldcpp_cu12.exe, koboldcpp_oldcpu.exe, koboldcpp_nocuda.exe, koboldcpp-linux-x64-cuda1210, and koboldcpp-linux-x64-cuda1150 will be removed very soon. Please switch to the new filenames.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@0cc4m

koboldcpp-1.92.1

early bug is for the birds edition

Added support for SWA mode which uses much less memory for the KV cache, use --useswa to enable.
- Note: SWA mode is not compatible with ContextShifting, and may result in degraded output when used with FastForwarding.
Fixed an off-by-one error in some cases when Fast Forwarding that resulted in degraded output.
Greatly improved tool calling by enforcing grammar on the output field names, and doing the automatic tool selection as a separate pass. Tool calling should be much more reliable now.
Added model size information in the HF Huggingface Search and download menu
CLI terminal output is now truncated in the middle of very long strings instead of at the end.
Fixed unicode path handling for Image Generation models.
Enabled threadpools, this should result in a speedup for Qwen3MoE.
Merged Vision support for Llama4 models, simplified some vision preprocessing code.
Fixes for prompt formatting for GLM4 models. GLM4 batch processing on Vulkan is fixed (thanks @0cc4m).
Fixed incorrect AutoGuess adapter for some Mistral models. Also fixed some KoboldCppAutomatic placeholder tag replacements.
AI Horde default advertised context now matches main max context by default. This can be changed.
Disable --showgui if --skiplauncher is used
StableUI now increments clip_skip and seed correctly when generating multiple images in a batch (thanks @wbruna)
clip_skip is now stored inside image metadata, and random seed's actual number is also indicated.
Added DDIM sampler for image generation.
Added a simple optional python reqs install script in launch.cmd for launching when run from unpacked directories.
Updated Kobold Lite, multiple fixes and improvements
- Integrated dPaste.org (open source pastebin) as a platform for quickly sharing Save Files. You can also use a self hosted instance by changing the endpoint URL. You can now share stories as a single URL with Save/Load > Share > Export Share as Web URL
- Added an option to allow Horizontal Stacking of multiple images in one row.
- Fixed importing of Chub.AI character cards as they changed their endpoint.
- Added support for RisuAI V3 character cards (.charx archive format), also fixed KAISTORY handling.
- SSE streaming is now the default for all cases. It can be disabled in Advanced Settings.
- Changed markdown renderer to render markdown separately for each instruct turn.
- Better passthrough for KoboldCppAutomatic instruct preset, especially with split tags.
- Added an option to use TTS from Pollinations API, which routes through OpenAI TTS models. Note that this TTS service has a server-side censorship via a content filter that I cannot control.
- Lite now sends stop sequences in OpenAI Chat Completions Endpoint mode (up to 4)
- Added ST based randomizer macros like {{roll:3d6}} (thanks @hu-yijie)
- Added new Immortal sampler preset by Jeb Carter
- In polled streaming mode, you can fetch last generated text if the request fails halfway.
- Added an exit button when editing raw text in corpo mode.
- Re-enabled a debug option for using raw placeholder tags on request. Not recommended.
- Added a debug option that allows changing the connected API at runtime.
Merged fixes and improvements from upstream

Hotfix 1.92.1 - Fixes for a GLM4 vulkan bug, allow extra EOG tokens to trigger a stop.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3 etc) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

This release contains the latest KoboldCpp tools used to convert and quantize models. Alternatively, you can also use the tools released by the llama.cpp project, they should be cross compatible. The binaries here will be periodically updated.

@Cohee1207

koboldcpp-1.91

Entering search mode edition

NEW: Huggingface Model Search Tool - Grabbing a model has never been easier! KoboldCpp now comes with a HF model browser so you can search and find the GGUF models you like directly from huggingface. Simply search for and select the model, and it will be downloaded before launch.
Embedded aria2c downloader for windows builds - this provides extremely fast downloads and is automatically used when downloading models via provided URLs.
Added CUDA target for compute capability 3.5. This may allow KoboldCpp to be used with K6000, GTX 780, K80. I have received some success stories - if you do try, share your experiences on the discussions page!
Reduced CUDA binary sizes by switching most cuda cc targets to virtual, thanks to a good suggestion from Johannes at ggml-org#13135
Improved ComfyUI emulation, can now adapt to any kind of workflow so long as there is a KSampler node connected to a text prompt somewhere in it.
Fixed GLM-4 prompt handling even for quants with incorrect BOS set.
Added support for Classifier-Free Guidance (CFG) since I wanted to mess with it. At long last I have finally added CFG, but I don't really like it - results are not great. Anyway, if you wish to use it simple check Enable Guidance or use --enableguidance, then set a negative prompt and CFG scale from the lite tokens menu. Note that guidance doubles KV usage and halves generation speed. Overall, it was a disappointing addition and not really worth the effort.
StableUI now clears the queue when cancelling a generation
Further fixes for Zenity/YAD in multilingual environments
Removed flash attention limits and warnings for Vulkan
Updated Kobold Lite, multiple fixes and improvements
- Important Change: KoboldCppAuto is now the default instruct preset. This will let the KoboldCpp backend automatically choose the correct instruct tags to use at runtime, based on the model loaded. This is done transparently in the backend and not visible to the user. If it doesn't work properly, you can always still switch to your preferred instruct format (e.g. Alpaca).
- NEW: Corpo mode now supports Text mode and Adventure mode as well, making it usable in all 4 modes.
- Added quick save and delete buttons for corpo mode.
- Added Pollinations.ai as an option for TTS and Image Gen (optional online service)
- Instruct placeholders are now always used (but you can change what they map to, including themselves)
- Added confirmation box for loading from slots
- Improved think tag handling and output formatting.
- Added a new scenario: Nemesis
- Chat match any name is no longer on by default
- Fixed autoscroll jumping on edit in corpo mode
- Fix char spec v2 embedded WI import by @Cohee1207
Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

@CISC

koboldcpp-1.90.2

Qwen of the line edition

NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.
Merged support for Qwen3. Now also triggers --nobostoken automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models.
Fixes for THUDM GLM-4, note that this model enforces --blasbatchsize 16 or smaller in order to get coherent output.
Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.
Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max --contextsize or lowering --visionmaxres.
Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.
--blasbatchsize -1 now exclusively uses a batch size of 1 when processing prompt. Also permitted --blasbatchsize 16 which replicates the old behavior (batch of 16 does not trigger GEMM).
KCPP API server now correctly handles explicitly set nulled fields.
Fixed Zenity/YAD detection not working correctly in the previous version.
Improved input sanitization when launching and passing url as a model param, Also for better security, --onready shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file.
More robust checks for system glslc when building vulkan shaders.
Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.
Added new flag --mmprojcpu that allows you to load and run the projector on CPU while keeping the main model on GPU.
noscript mode randomizes generated image names to prevent browser caching.
Updated Kobold Lite, multiple fixes and improvements
- Increased default tokens generated and slider limits (can be overridden)
- ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
- Added toggle to disable LaTeX while leaving markdown enabled
Merged fixes and improvements from upstream
Hotfix 1.90.1:
- Reworked thinking tags handling. ChatML (No thinking) is removed, instead, thinking can be forced or prevented for all instruct formats (Settings > Tokens > CoT).
- More GLM4 fixes, now works fine with larger batches on CUDA, on vulkan glm4 ubatch size is still limited to 16.
- Some chat completions parsing fixes.
- Updated Lite with a new scenario
Hotfix 1.90.2:
- Pulled further upstream updates. Massive file size increase caused by ggml-org#13199, I can't do anything about it. Don't ask me.
- NEW: Added a hugginface model search tool! Now you can find, browse and download models straight from huggingface.
- Increased --defaultgenamount range
- Try to fix YAD GUI launcher
- Added rudimentary websocket spoof for ComfyUI, increased comfyui compatibility.
- Fixed a few parsing issues for nulled chat completions params
- Automatically handle multipart file downloading, up to 9 parts.
- Fixed rope config not saving correctly to kcpps sometimes
- Merged fixes for Plamo models, thanks to @CISC

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

koboldcpp-1.98.1

Contributors

Uh oh!

koboldcpp-1.97.4

Uh oh!

koboldcpp-1.96.2

Contributors

Uh oh!

koboldcpp-1.95.1

Contributors

Uh oh!

koboldcpp-1.94.2

Contributors

Uh oh!

koboldcpp-1.93.2

Important Breaking Changes (File Naming Change Notice):

Contributors

Uh oh!

koboldcpp-1.92.1

Contributors

Uh oh!

Uh oh!

koboldcpp-1.91

Contributors

Uh oh!

koboldcpp-1.90.2

Contributors

Uh oh!

Releases: LostRuins/koboldcpp

koboldcpp-1.98.1

koboldcpp-1.98.1

Contributors

Uh oh!

koboldcpp-1.97.4

koboldcpp-1.97.4

Uh oh!

koboldcpp-1.96.2

koboldcpp-1.96.2

Contributors

Uh oh!

koboldcpp-1.95.1

koboldcpp-1.95.1

Contributors

Uh oh!

koboldcpp-1.94.2

koboldcpp-1.94.2

Contributors

Uh oh!

koboldcpp-1.93.2

koboldcpp-1.93.2

Important Breaking Changes (File Naming Change Notice):

Contributors

Uh oh!

koboldcpp-1.92.1

koboldcpp-1.92.1

Contributors

Uh oh!

kcpp_tools_rolling

Uh oh!

koboldcpp-1.91

koboldcpp-1.91

Contributors

Uh oh!

koboldcpp-1.90.2

koboldcpp-1.90.2

Contributors

Uh oh!