Skip to content

Conversation

ochafik
Copy link

@ochafik ochafik commented Feb 25, 2025

Description

This adds a provider for llama.cpp's llama-server, leveraging its recently introduced grammar-constrained, universal tool call support.

Structural changes:

  • Refactored system.ts:
    • Optionally return tool definitions to pass to ApiHandler.createMessage (if ModelInfo.supportsTools)
    • Add provider-specific formatting of tool call examples (implemented w/ llama.cpp api calls - cached)

Limitations:

  • No streaming yet, so you may wanna launch llama-server with the --verbose flag to see what's happening (especially when the KV cache isn't hot yet on first call)
    • llama-server doesn't support tool calls in streaming mode yet
    • I'm translating the tool_calls output back to the XML-ish format Cline expects, so this would also need a revamp to support streaming
  • Calling llama-server's /apply-template endpoint w/ + fresh new add_generation_prompt param to compute tool call deltas. This is used in the system prompt to express the few-shot tool use examples with the call syntax native to the model, which helps it tremendously (esp. when small).
  • Need an async call to know the model's max tokens, which makes the code slightly awkward (getModel may return default values before props are fetched; may want to make getModel async as follow up?)
  • Qwen 2.5 Coder support needs tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars ggml-org/llama.cpp#12034 (under review / not merged yet)

TODOs / follow ups:

Note that switching to native tool use for other providers should be possible (even for Claude, assuming it doesn't mess with its caching), but one would need to find a nice syntax to format tool call examples.

Test Procedure

I haven't tested this thoroughly yet, but did check that the system prompt didn't budge (only diff is in Example 6 where JSON arguments are now fully indented - incl. labels & assignees)

Mostly had to disabled MCP to save prompt, tested successfully w/ long context window models (128k tokens):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-bench-prod
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

llama-server --jinja -fa -hf unsloth/phi-4-GGUF:Q6_K -c 131072 -ngl 999
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L -c 0 -ngl 999
llama-server --jinja -fa -hf unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF -c 0 -ngl 999

Some notes:

Random prompts tested:

  • write a web app that creates qr codes
  • write a lisp parser in c++
  • write a trie in c++
  • build a todo list web app that persists in local storage

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update

Pre-flight Checklist

  • Changes are limited to a single feature, bugfix or chore (split larger changes into separate PRs)
  • happy to split this
  • Tests are passing (npm test) and code is formatted and linted (npm run format && npm run lint)
  • I have created a changeset using npm run changeset (required for user-facing changes)
  • I have reviewed contributor guidelines

Screenshots

image

Additional Notes

Still some todos but could be done as follow ups.


Important

Adds llama.cpp provider with tool call support, updating API handling, configuration, and UI components to integrate llama-server.

  • Behavior:
    • Adds LlamaCppHandler in llama.cpp.ts to support llama.cpp's llama-server with tool call functionality.
    • Updates createMessage function across multiple providers to include tools parameter.
    • Implements provider-specific tool call formatting in system.ts.
  • Configuration:
    • Adds llama.cpp as a new ApiProvider in api.ts.
    • Updates ApiOptions component to include llama.cpp configuration fields.
    • Adds llamaCppBaseUrl and llamaCppApiKey to ApiHandlerOptions.
  • UI:
    • Updates ChatTextArea.tsx and ApiOptions.tsx to support llama.cpp in the UI.
    • Adds llama.cpp specific fields in the settings UI for base URL and API key.
  • Misc:
    • Adds llamaCppModelInfoSaneDefaults in api.ts for default model info.
    • Updates ClineProvider.ts to handle llama.cpp specific configurations.

This description was created by Ellipsis for 6760db3. It will automatically update as commits are pushed.

Copy link

changeset-bot bot commented Feb 25, 2025

🦋 Changeset detected

Latest commit: 6760db3

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
claude-dev Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@ochafik ochafik marked this pull request as ready for review February 25, 2025 15:27
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 6760db3 in 3 minutes and 15 seconds

More details
  • Looked at 1648 lines of code in 26 files
  • Skipped 0 files when reviewing.
  • Skipped posting 10 drafted comments based on config settings.
1. src/shared/api.ts:1446
  • Draft comment:
    In normalizeApiConfiguration, the 'llama.cpp' case returns an empty model ID and uses the default model info. Confirm if this behavior is intended and well documented, as it may lead to UI inconsistencies.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. webview-ui/src/components/settings/ApiOptions.tsx:1127
  • Draft comment:
    The UI for the "llama.cpp" provider appears consistent. Verify that its configuration instructions and placeholder text properly inform the user on how to configure the provider.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is asking the PR author to verify that the configuration instructions and placeholder text are correct. This falls under the rule of not asking the author to double-check or ensure something is correct, which is not allowed.
3. webview-ui/src/components/chat/ChatTextArea.tsx:757
  • Draft comment:
    For the 'llama.cpp' provider case, the code returns the literal string "undefined" as the model display name. This is inconsistent with the other cases (which use a formatted provider:model id) and may confuse users. Consider displaying a more descriptive label (for example, simply 'llama.cpp' or another appropriate placeholder) instead of "undefined".
  • Reason this comment was not posted:
    Marked as duplicate.
4. webview-ui/src/components/settings/ApiOptions.tsx:1127
  • Draft comment:
    Ensure that the labels and placeholders for the 'llama.cpp' provider (Base URL and API Key) align with user expectations. Although the setup is consistent with other providers, consider if any additional guidance or fallback defaults are needed given that the corresponding selected model id remains empty.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 20% <= threshold 50%
    The comment is asking the author to ensure alignment with user expectations and consider additional guidance or defaults. It doesn't provide a specific suggestion or point out a clear issue with the code. It seems to be more of a general suggestion rather than a specific actionable comment.
5. src/api/providers/bedrock.ts:93
  • Draft comment:
    Typo detected: In the comment on line 93, the phrase 'executing a an AWS provider chain' should be corrected to 'executing an AWS provider chain' (remove the extra 'a').
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
6. src/api/providers/bedrock.ts:110
  • Draft comment:
    Typo detected: In the comment on line 110, the phrase 'will already be already provided' contains a duplicated 'already'. Please remove one for clarity.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
7. src/api/providers/openai-native.ts:33
  • Draft comment:
    Typo: In the comment on line 33, please change 'o1 doesnt support streaming, non-1 temp, or system prompt' to 'o1 doesn't support streaming, non-1 temp, or system prompt'.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. src/api/providers/openrouter.ts:72
  • Draft comment:
    Typo in comment: 'a image_url type message' should be 'an image_url type message'.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. src/core/webview/ClineProvider.ts:330
  • Draft comment:
    Typo: In the comment explaining the codicon font usage, 'css fileinto' should be updated to 'css file into' for clarity.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
10. src/core/webview/ClineProvider.ts:1720
  • Draft comment:
    Potential Typo: The global state key 'qwenApiLine' is used in getState() while the secret key uses 'qwenApiKey'. Consider verifying if 'qwenApiLine' is intended or if it should be 'qwenApiKey' to maintain consistency.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_PPNIES0QeEzoZdBI


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@@ -755,6 +755,8 @@ const ChatTextArea = forwardRef<HTMLTextAreaElement, ChatTextAreaProps>(
return `${selectedProvider}:${apiConfiguration.lmStudioModelId}`
case "ollama":
return `${selectedProvider}:${apiConfiguration.ollamaModelId}`
case "llama.cpp":
return "undefined"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the switch for provider display names, the case for "llama.cpp" returns the literal string "undefined". Consider providing a meaningful default label instead of the literal "undefined".

Suggested change
return "undefined"
return "llama.cpp:default"

@saoudrizwan
Copy link
Contributor

It seems like we're not passing in any tools to the requests–these are already defined in the system prompt, and allows Cline to work with models/APIs that dont support tool calling.

Not sure how much of a demand there is for llama.cpp as a provider, but perhaps we could limit the change to that? Feel free to re-open

@ochafik
Copy link
Author

ochafik commented Feb 27, 2025

It seems like we're not passing in any tools to the requests–these are already defined in the system prompt, and allows Cline to work with models/APIs that dont support tool calling.

@saoudrizwan It does (great job btw!), but the prompt is very optimised for Claude (XML heavy) and confuses smaller local models, especially when they've been fine-tuned to call tools with JSON-based syntaxes (and with tool descriptions in a format chosen by their Jinja template, typically just verbatim JSON schema but sometimes the template converts the tool signatures to TypeScript, e.g for functionary templates). Besides, it's not just XML vs. JSON, most models now use special tokens to signal tool calls boundaries, so it's best to show them examples in the format they know.

In a nutshell, this PR:

  • Passes the tools in the query only for llama.cpp:
    • this allows triggering grammar-constrained tool calling (increasingly robust even for very small models @ any temperatures),
    • it ensures models are presented w/ tools in a familiar format.
  • Expresses the tool call examples (in the system prompt) using the native tool call style of the model.
    • This ensures the model is confirmed in its tool call intuitions
    • This makes it more+++ likely the model's output triggers the grammar-constrained generation (specific to each model's tool call style format) that ~guarantee tool call schema compliance
  • Ensures the rest of Cline is unmodified
    • tool call formatter for the examples defaults to the XML syntax,
    • the tool calls returned by llama.cpp are retrofitted into the expected XML syntax so Cline's parser feels at home

Not sure how much of a demand there is for llama.cpp as a provider, but perhaps we could limit the change to that? Feel free to re-open

I only got disappointing results with the existing OpenAI-Compatible provider talking to llama.cpp.

But if you talk to Qwen 2.5 Coder 32B or even Phi 4 with the right prompt, they make wonders :-D

cc/ @ngxson @ggerganov

@ochafik
Copy link
Author

ochafik commented Feb 27, 2025

Not sure how much of a demand there is for llama.cpp as a provider

Note that llama-server now also powers HuggingFace Inference Endpoints so it’s not just for local ai enthusiasts.

@DavyA
Copy link

DavyA commented May 9, 2025

News, is there a way to run Cline with llama.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants