Crystal bindings for llama.cpp, a C/C++ implementation of LLaMA, Falcon, GPT-2, and other large language models.
Please check the LLAMA_VERSION file for the current compatible version of llama.cpp.
This project is under active development and may change rapidly.
- Low-level bindings to the llama.cpp C API
- High-level Crystal wrapper classes for easy usage
- Memory management for C resources
- Simple text generation interface
- Advanced sampling methods (Min-P, Typical, Mirostat, etc.)
- Batch processing for efficient token handling
- KV cache management for optimized inference
- State saving and loading
You need the llama.cpp shared library (libllama) available on your system.
LLAMA_VERSION=$(cat LLAMA_VERSION)
curl -L "https://github.com/ggml-org/llama.cpp/releases/download/${LLAMA_VERSION}/llama-${LLAMA_VERSION}-bin-ubuntu-x64.zip" -o llama.zip
unzip llama.zip
sudo cp build/bin/*.so /usr/local/lib/
sudo ldconfig
For macOS, replace ubuntu-x64
with macos-arm64
and *.so
with *.dylib
.
Alternative: Using LLAMA_CPP_DIR
If you prefer not to install system-wide, you can set the LLAMA_CPP_DIR
environment variable:
export LLAMA_CPP_DIR=/path/to/llama.cpp
crystal build examples/simple.cr
LLAMA_CPP_DIR=/path/to/llama.cpp ./simple_example --model models/tiny_model.gguf
Build from source (advanced users)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout $(cat ../LLAMA_VERSION)
mkdir build && cd build
cmake .. && cmake --build . --config Release
sudo cmake --install . && sudo ldconfig
You'll need a model file in GGUF format. For testing, smaller quantized models (1-3B parameters) with Q4_K_M quantization are recommended.
Popular options:
Add the dependency to your shard.yml
:
dependencies:
llama:
github: kojix2/llama.cr
Then run shards install
.
require "llama"
# Load a model
model = Llama::Model.new("/path/to/model.gguf")
# Create a context
context = model.context
# Generate text
response = context.generate("Once upon a time", max_tokens: 100, temperature: 0.8)
puts response
# Or use the convenience method
response = Llama.generate("/path/to/model.gguf", "Once upon a time")
puts response
require "llama"
model = Llama::Model.new("/path/to/model.gguf")
context = model.context
# Create a sampler chain with multiple sampling methods
chain = Llama::SamplerChain.new
chain.add(Llama::Sampler::TopK.new(40))
chain.add(Llama::Sampler::MinP.new(0.05, 1))
chain.add(Llama::Sampler::Temp.new(0.8))
chain.add(Llama::Sampler::Dist.new(42))
# Generate text with the custom sampler chain
result = context.generate_with_sampler("Write a short poem about AI:", chain, 150)
puts result
require "llama"
require "llama/chat"
model = Llama::Model.new("/path/to/model.gguf")
context = model.context
# Create a chat conversation
messages = [
Llama::ChatMessage.new("system", "You are a helpful assistant."),
Llama::ChatMessage.new("user", "Hello, who are you?")
]
# Generate a response
response = context.chat(messages)
puts "Assistant: #{response}"
# Continue the conversation
messages << Llama::ChatMessage.new("assistant", response)
messages << Llama::ChatMessage.new("user", "Tell me a joke")
response = context.chat(messages)
puts "Assistant: #{response}"
require "llama"
model = Llama::Model.new("/path/to/model.gguf")
# Create a context with embeddings enabled
context = model.context(embeddings: true)
# Get embeddings for text
text = "Hello, world!"
tokens = model.vocab.tokenize(text)
batch = Llama::Batch.get_one(tokens)
context.decode(batch)
embeddings = context.get_embeddings_seq(0)
puts "Embedding dimension: #{embeddings.size}"
puts Llama.system_info
model = Llama::Model.new("/path/to/model.gguf")
puts Llama.tokenize_and_format(model.vocab, "Hello, world!", ids_only: true)
The examples
directory contains sample code demonstrating various features:
simple.cr
- Basic text generationchat.cr
- Chat conversations with modelstokenize.cr
- Tokenization and vocabulary features
See kojix2.github.io/llama.cr for full API docs.
- Llama::Model - Represents a loaded LLaMA model
- Llama::Context - Handles inference state for a model
- Llama::Vocab - Provides access to the model's vocabulary
- Llama::Batch - Manages batches of tokens for efficient processing
- Llama::KvCache - Controls the key-value cache for optimized inference
- Llama::State - Handles saving and loading model state
- Llama::SamplerChain - Combines multiple sampling methods
- Llama::Sampler::TopK - Keeps only the top K most likely tokens
- Llama::Sampler::TopP - Nucleus sampling (keeps tokens until cumulative probability exceeds P)
- Llama::Sampler::Temp - Applies temperature to logits
- Llama::Sampler::Dist - Samples from the final probability distribution
- Llama::Sampler::MinP - Keeps tokens with probability >= P * max_probability
- Llama::Sampler::Typical - Selects tokens based on their "typicality" (entropy)
- Llama::Sampler::Mirostat - Dynamically adjusts sampling to maintain target entropy
- Llama::Sampler::Penalties - Applies penalties to reduce repetition
See DEVELOPMENT.md for development guidelines.
Do you need commit rights?
- If you need commit rights to my repository or want to get admin rights and take over the project, please feel free to contact @kojix2.
- Many OSS projects become abandoned because only the founder has commit rights to the original repository.
- Fork it (https://github.com/kojix2/llama.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
This project is available under the MIT License. See the LICENSE file for more info.