Skip to content

Windows

igardev edited this page Aug 11, 2025 · 16 revisions

Setup llama.cpp servers for Windows

Show llama-vscode menu (Ctrl+Shift+M) and select "Install/upgrade llama.cpp" (if not yet done). After that add/select the models you want to use.

The instructions below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.

Code completion server

Used for
- code completion

LLM type
- FIM (fill in the middle)

Instructions

Install llama.cpp

`winget install llama.cpp`

OR

Download the release files for Windows for llama.cpp from releases. For CPU use llama--bin-win-cpu-.zip. For Nvidia: llama--bin-win-cuda-x64.zip and if you don't have cuda drivers installed also cudart-llama-bin-win-cuda*-x64.zip.

Run llama.cpp server

No GPUs

`llama-server.exe --fim-qwen-1.5b-default --port 8012`  

With GPUs

`llama-server.exe --fim-qwen-1.5b-default --port 8012 -ngl 99`  

If you've installed llama.cpp with winget you could skip the .exe suffix and use just llama-server in the commands.

Now you could start using llama-vscode extension for code completion.

More details about llama.cpp server

Chat server

Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generate commit message

LLM type
- Chat Models

Instructions

Same like code completion server, but use chat model and a little bit different parameters.

CPU-only:

`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011`  

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-7b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`  
  • less than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-3b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`  
  • less than 8GB VRAM
`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99` 

Embeddings server

Used for
- Chat with AI with project context

LLM type
- Embedding

Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.

`llama-server.exe -hf nomic-embed-text-v2-moe-q8_0.gguf --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`  
Clone this wiki locally