feat: add llama.cpp provider w/ native tool calls #1946

ochafik · 2025-02-25T07:03:55Z

Description

This adds a provider for llama.cpp's llama-server, leveraging its recently introduced grammar-constrained, universal tool call support.

Structural changes:

Refactored system.ts:
- Optionally return tool definitions to pass to ApiHandler.createMessage (if ModelInfo.supportsTools)
- Add provider-specific formatting of tool call examples (implemented w/ llama.cpp api calls - cached)

Limitations:

No streaming yet, so you may wanna launch llama-server with the --verbose flag to see what's happening (especially when the KV cache isn't hot yet on first call)
- llama-server doesn't support tool calls in streaming mode yet
- I'm translating the tool_calls output back to the XML-ish format Cline expects, so this would also need a revamp to support streaming
Calling llama-server's /apply-template endpoint w/ + fresh new add_generation_prompt param to compute tool call deltas. This is used in the system prompt to express the few-shot tool use examples with the call syntax native to the model, which helps it tremendously (esp. when small).
Need an async call to know the model's max tokens, which makes the code slightly awkward (getModel may return default values before props are fetched; may want to make getModel async as follow up?)
Qwen 2.5 Coder support needs tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars ggml-org/llama.cpp#12034 (under review / not merged yet)

TODOs / follow ups:

Merge server: support add_generation_prompt query param ggml-org/llama.cpp#12062
Fix Context Window gauge
Nicer error messages when llama-server is down (currently cryptic fetch failed)
Update help text to heavily suggest disabling MCP to save context
Merge tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars ggml-org/llama.cpp#12034 (non-blocker)

Note that switching to native tool use for other providers should be possible (even for Claude, assuming it doesn't mess with its caching), but one would need to find a nice syntax to format tool call examples.

Test Procedure

I haven't tested this thoroughly yet, but did check that the system prompt didn't budge (only diff is in Example 6 where JSON arguments are now fully indented - incl. labels & assignees)

Mostly had to disabled MCP to save prompt, tested successfully w/ long context window models (128k tokens):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-bench-prod
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

llama-server --jinja -fa -hf unsloth/phi-4-GGUF:Q6_K -c 131072 -ngl 999
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L -c 0 -ngl 999
llama-server --jinja -fa -hf unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF -c 0 -ngl 999

Some notes:

Qwen 2.5 Coder 32B 128k works best but requires this PR: tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars ggml-org/llama.cpp#12034
unsloth/phi-4-GGUF supports 128k context window but it's not set in its properties (the other models support -c 0 to mean "use the model's maximum")

Random prompts tested:

write a web app that creates qr codes
write a lisp parser in c++
write a trie in c++
build a todo list web app that persists in local storage

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update

Pre-flight Checklist

Changes are limited to a single feature, bugfix or chore (split larger changes into separate PRs)
happy to split this
Tests are passing (npm test) and code is formatted and linted (npm run format && npm run lint)
I have created a changeset using npm run changeset (required for user-facing changes)
I have reviewed contributor guidelines

Screenshots

Additional Notes

Still some todos but could be done as follow ups.

Important

Adds llama.cpp provider with tool call support, updating API handling, configuration, and UI components to integrate llama-server.

Behavior:
- Adds LlamaCppHandler in llama.cpp.ts to support llama.cpp's llama-server with tool call functionality.
- Updates createMessage function across multiple providers to include tools parameter.
- Implements provider-specific tool call formatting in system.ts.
Configuration:
- Adds llama.cpp as a new ApiProvider in api.ts.
- Updates ApiOptions component to include llama.cpp configuration fields.
- Adds llamaCppBaseUrl and llamaCppApiKey to ApiHandlerOptions.
UI:
- Updates ChatTextArea.tsx and ApiOptions.tsx to support llama.cpp in the UI.
- Adds llama.cpp specific fields in the settings UI for base URL and API key.
Misc:
- Adds llamaCppModelInfoSaneDefaults in api.ts for default model info.
- Updates ClineProvider.ts to handle llama.cpp specific configurations.

^{This description was created by}^{for 6760db3. It will automatically update as commits are pushed.}

changeset-bot · 2025-02-25T07:04:00Z

🦋 Changeset detected

Latest commit: 6760db3

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
claude-dev	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

ellipsis-dev

❌ Changes requested. Reviewed everything up to 6760db3 in 3 minutes and 15 seconds

More details

Looked at 1648 lines of code in 26 files
Skipped 0 files when reviewing.
Skipped posting 10 drafted comments based on config settings.

1. src/shared/api.ts:1446

Draft comment:
In normalizeApiConfiguration, the 'llama.cpp' case returns an empty model ID and uses the default model info. Confirm if this behavior is intended and well documented, as it may lead to UI inconsistencies.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. webview-ui/src/components/settings/ApiOptions.tsx:1127

Draft comment:
The UI for the "llama.cpp" provider appears consistent. Verify that its configuration instructions and placeholder text properly inform the user on how to configure the provider.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is asking the PR author to verify that the configuration instructions and placeholder text are correct. This falls under the rule of not asking the author to double-check or ensure something is correct, which is not allowed.

3. webview-ui/src/components/chat/ChatTextArea.tsx:757

Draft comment:
For the 'llama.cpp' provider case, the code returns the literal string "undefined" as the model display name. This is inconsistent with the other cases (which use a formatted provider:model id) and may confuse users. Consider displaying a more descriptive label (for example, simply 'llama.cpp' or another appropriate placeholder) instead of "undefined".
Reason this comment was not posted:
Marked as duplicate.

4. webview-ui/src/components/settings/ApiOptions.tsx:1127

Draft comment:
Ensure that the labels and placeholders for the 'llama.cpp' provider (Base URL and API Key) align with user expectations. Although the setup is consistent with other providers, consider if any additional guidance or fallback defaults are needed given that the corresponding selected model id remains empty.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 20% <= threshold 50%
The comment is asking the author to ensure alignment with user expectations and consider additional guidance or defaults. It doesn't provide a specific suggestion or point out a clear issue with the code. It seems to be more of a general suggestion rather than a specific actionable comment.

5. src/api/providers/bedrock.ts:93

Draft comment:
Typo detected: In the comment on line 93, the phrase 'executing a an AWS provider chain' should be corrected to 'executing an AWS provider chain' (remove the extra 'a').
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

6. src/api/providers/bedrock.ts:110

Draft comment:
Typo detected: In the comment on line 110, the phrase 'will already be already provided' contains a duplicated 'already'. Please remove one for clarity.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

7. src/api/providers/openai-native.ts:33

Draft comment:
Typo: In the comment on line 33, please change 'o1 doesnt support streaming, non-1 temp, or system prompt' to 'o1 doesn't support streaming, non-1 temp, or system prompt'.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. src/api/providers/openrouter.ts:72

Draft comment:
Typo in comment: 'a image_url type message' should be 'an image_url type message'.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

9. src/core/webview/ClineProvider.ts:330

Draft comment:
Typo: In the comment explaining the codicon font usage, 'css fileinto' should be updated to 'css file into' for clarity.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

10. src/core/webview/ClineProvider.ts:1720

Draft comment:
Potential Typo: The global state key 'qwenApiLine' is used in getState() while the secret key uses 'qwenApiKey'. Consider verifying if 'qwenApiLine' is intended or if it should be 'qwenApiKey' to maintain consistency.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_PPNIES0QeEzoZdBI

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-25T15:30:40Z

webview-ui/src/components/chat/ChatTextArea.tsx

@@ -755,6 +755,8 @@ const ChatTextArea = forwardRef<HTMLTextAreaElement, ChatTextAreaProps>(
 					return `${selectedProvider}:${apiConfiguration.lmStudioModelId}`
 				case "ollama":
 					return `${selectedProvider}:${apiConfiguration.ollamaModelId}`
+				case "llama.cpp":
+					return "undefined"


In the switch for provider display names, the case for "llama.cpp" returns the literal string "undefined". Consider providing a meaningful default label instead of the literal "undefined".

Suggested change

return "undefined"

return "llama.cpp:default"

saoudrizwan · 2025-02-27T03:04:42Z

It seems like we're not passing in any tools to the requests–these are already defined in the system prompt, and allows Cline to work with models/APIs that dont support tool calling.

Not sure how much of a demand there is for llama.cpp as a provider, but perhaps we could limit the change to that? Feel free to re-open

ochafik · 2025-02-27T19:09:02Z

It seems like we're not passing in any tools to the requests–these are already defined in the system prompt, and allows Cline to work with models/APIs that dont support tool calling.

@saoudrizwan It does (great job btw!), but the prompt is very optimised for Claude (XML heavy) and confuses smaller local models, especially when they've been fine-tuned to call tools with JSON-based syntaxes (and with tool descriptions in a format chosen by their Jinja template, typically just verbatim JSON schema but sometimes the template converts the tool signatures to TypeScript, e.g for functionary templates). Besides, it's not just XML vs. JSON, most models now use special tokens to signal tool calls boundaries, so it's best to show them examples in the format they know.

In a nutshell, this PR:

Passes the tools in the query only for llama.cpp:
- this allows triggering grammar-constrained tool calling (increasingly robust even for very small models @ any temperatures),
- it ensures models are presented w/ tools in a familiar format.
Expresses the tool call examples (in the system prompt) using the native tool call style of the model.
- This ensures the model is confirmed in its tool call intuitions
- This makes it more+++ likely the model's output triggers the grammar-constrained generation (specific to each model's tool call style format) that ~guarantee tool call schema compliance
Ensures the rest of Cline is unmodified
- tool call formatter for the examples defaults to the XML syntax,
- the tool calls returned by llama.cpp are retrofitted into the expected XML syntax so Cline's parser feels at home

Not sure how much of a demand there is for llama.cpp as a provider, but perhaps we could limit the change to that? Feel free to re-open

I only got disappointing results with the existing OpenAI-Compatible provider talking to llama.cpp.

But if you talk to Qwen 2.5 Coder 32B or even Phi 4 with the right prompt, they make wonders :-D

cc/ @ngxson @ggerganov

ochafik · 2025-02-27T20:04:26Z

Not sure how much of a demand there is for llama.cpp as a provider

Note that llama-server now also powers HuggingFace Inference Endpoints so it’s not just for local ai enthusiasts.

DavyA · 2025-05-09T11:38:48Z

News, is there a way to run Cline with llama.cpp?

ochafik added 3 commits February 25, 2025 05:53

add optional tools to ApiHandler.createMessage

f081382

Revamp system.ts to return tools & use them if api supports them

41572e0

Add llama.cpp provider (w/ tools)

f74dd29

This was referenced Feb 25, 2025

server: support add_generation_prompt query param ggml-org/llama.cpp#12062

Merged

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars ggml-org/llama.cpp#12034

Merged

dess-oss approved these changes Feb 25, 2025

View reviewed changes

add changeset

6760db3

ochafik marked this pull request as ready for review February 25, 2025 15:27

ochafik requested review from saoudrizwan, ocasta181, NightTrek and pashpashpash as code owners February 25, 2025 15:27

ellipsis-dev bot reviewed Feb 25, 2025

View reviewed changes

saoudrizwan closed this Feb 27, 2025

ochafik mentioned this pull request Mar 15, 2025

server: streaming of tool calls and thoughts when --jinja is on ggml-org/llama.cpp#12379

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add llama.cpp provider w/ native tool calls #1946

feat: add llama.cpp provider w/ native tool calls #1946

Uh oh!

ochafik commented Feb 25, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Feb 25, 2025 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Feb 25, 2025

Uh oh!

saoudrizwan commented Feb 27, 2025

Uh oh!

ochafik commented Feb 27, 2025 •

edited

Loading

Uh oh!

ochafik commented Feb 27, 2025

Uh oh!

DavyA commented May 9, 2025

Uh oh!

Uh oh!

feat: add llama.cpp provider w/ native tool calls #1946

feat: add llama.cpp provider w/ native tool calls #1946

Uh oh!

Conversation

ochafik commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Procedure

Type of Change

Pre-flight Checklist

Screenshots

Additional Notes

Uh oh!

changeset-bot bot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

saoudrizwan commented Feb 27, 2025

Uh oh!

ochafik commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochafik commented Feb 27, 2025

Uh oh!

DavyA commented May 9, 2025

Uh oh!

Uh oh!

ochafik commented Feb 25, 2025 •

edited

Loading

changeset-bot bot commented Feb 25, 2025 •

edited

Loading

ochafik commented Feb 27, 2025 •

edited

Loading