Windows

Setup llama.cpp servers for Windows

Show llama-vscode menu (Ctrl+Shift+M) and select "Install/upgrade llama.cpp" (if not yet done). After that add/select the models you want to use.

The instructions below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.

Code completion server

Used for
- code completion

LLM type
- FIM (fill in the middle)

Instructions

Install llama.cpp

`winget install llama.cpp`

OR

Download the release files for Windows for llama.cpp from releases. For CPU use llama--bin-win-cpu-.zip. For Nvidia: llama--bin-win-cuda-x64.zip and if you don't have cuda drivers installed also cudart-llama-bin-win-cuda*-x64.zip.

Run llama.cpp server

No GPUs

`llama-server.exe --fim-qwen-1.5b-default --port 8012`

With GPUs

`llama-server.exe --fim-qwen-1.5b-default --port 8012 -ngl 99`

If you've installed llama.cpp with winget you could skip the .exe suffix and use just llama-server in the commands.

Now you could start using llama-vscode extension for code completion.

More details about llama.cpp server

Chat server

Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generate commit message

LLM type
- Chat Models

Instructions

Same like code completion server, but use chat model and a little bit different parameters.

CPU-only:

`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011`

With Nvidia GPUs and installed cuda drivers

more than 16GB VRAM

`llama-server.exe -hf qwen2.5-coder-7b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`

less than 16GB VRAM

`llama-server.exe -hf qwen2.5-coder-3b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`

less than 8GB VRAM

`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`

Embeddings server

Used for
- Chat with AI with project context

LLM type
- Embedding

Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.

`llama-server.exe -hf nomic-embed-text-v2-moe-q8_0.gguf --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Windows

Setup llama.cpp servers for Windows

Code completion server

Install llama.cpp

Run llama.cpp server

Chat server

Embeddings server

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally