Skip to content
igardev edited this page Aug 11, 2025 · 7 revisions

Setup llama.cpp servers for Mac

Show llama-vscode menu (Ctrl+Shift+M) and select "Install/upgrade llama.cpp" (if not yet done). After that add/select the models you want to use.

The instructions below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.

Prerequisites - Homebrew

Code completion server

Used for
- code completion

LLM type
- FIM (fill in the middle)

Instructions

  1. Install llama.cpp with the command
`brew install llama.cpp`  
  1. Download the LLM model and run llama.cpp server (combined in one command)
  • If you have more than 16GB VRAM:
`llama-server -hf ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256`  
  • If you have less than 16GB VRAM:
`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:Q8_0 --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256`  

If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.

Chat server

Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generage commit message

LLM type
- Chat Models

Instructions
Same like code completion server, but use chat model and a little bit different parameters.

CPU-only:

`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2`  

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-7B-Instruct-Q8_0-GGUF --port 8011 -np 2`  
  • less than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 8011 -np 2`  
  • less than 8GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2` 

Embeddings server

Used for
- Chat with AI with project context

LLM type
- Embedding

Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.

`llama-server -hf ggml-org/Nomic-Embed-Text-V2-GGUF --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`  
Clone this wiki locally