-
Notifications
You must be signed in to change notification settings - Fork 74
Windows
Show llama-vscode menu (Ctrl+Shift+M) and select "Install/upgrade llama.cpp" (if not yet done). After that add/select the models you want to use.
The instructions below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.
Used for
- code completion
LLM type
- FIM (fill in the middle)
Instructions
`winget install llama.cpp`
OR
Download the release files for Windows for llama.cpp from releases. For CPU use llama--bin-win-cpu-.zip. For Nvidia: llama--bin-win-cuda-x64.zip and if you don't have cuda drivers installed also cudart-llama-bin-win-cuda*-x64.zip.
No GPUs
`llama-server.exe --fim-qwen-1.5b-default --port 8012`
With GPUs
`llama-server.exe --fim-qwen-1.5b-default --port 8012 -ngl 99`
If you've installed llama.cpp with winget you could skip the .exe suffix and use just llama-server in the commands.
Now you could start using llama-vscode extension for code completion.
More details about llama.cpp server
Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generate commit message
LLM type
- Chat Models
Instructions
Same like code completion server, but use chat model and a little bit different parameters.
CPU-only:
`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011`
With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-7b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`
- less than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-3b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`
- less than 8GB VRAM
`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99`
Used for
- Chat with AI with project context
LLM type
- Embedding
Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.
`llama-server.exe -hf nomic-embed-text-v2-moe-q8_0.gguf --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`