Re-introduce the `llama` package #5034

jmorganca · 2024-06-13T21:06:18Z

This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages:

C APIs can be called directly from Go without needing to use the previous "server" REST API
On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference
Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU)
No git submodule making it easier to clone and build from source

This is a big PR, but much of it is vendor code except for:

llama.go CGo bindings
example/: a simple example of running inference
runner/: a subprocess server designed to replace the llm/ext_server package
Makefile an as minimal as possible Makefile to build the runner package for different targets (cpu, avx, avx2, cuda, rocm)

The easiest way to try out the PR:

cd llama
make -j

Which will produce ollama_runner binaries based on the current platform.

dhiltgen

Looks like a great foundation to iterate from

llama/ggml-cuda/alibi.cu

llama/llama.go

llama/Makefile

llama/llama.go

This PR brings back the llama package, making it possible to call llama.cpp and ggml APIs from Go directly via CGo. This has a few advantages: - C APIs can be called directly from Go without needing to use the previous "server" REST API - On macOS and for CPU builds on Linux and Windows, Ollama can be built without a go generate ./... step, making it easy to get up and running to hack on parts of Ollama that don't require fast inference - Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners takes <5 min on a fast CPU) - No git submodule making it easier to clone and build from source This is a big PR, but much of it is vendor code except for: - llama.go CGo bindings - example/: a simple example of running inference - runner/: a subprocess server designed to replace the llm/ext_server package - Makefile an as minimal as possible Makefile to build the runner package for different targets (cpu, avx, avx2, cuda, rocm) Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

When forking a cache entry, if no empty slots are available we evict the least recently used one and copy over the KV entries from the closest match. However, this copy does not overwrite existing values but only adds new ones. Therefore, we need to clear the old slot first. This change fixes two issues: - The KV cache fills up and runs out of space even though we think we are managing it correctly - Performance gets worse over time as we use new cache entries that are not hot in the processor caches

)

This breaks up the monolithic Makefile for the Go based runners into a set of utility files as well as recursive Makefiles for the runners. Files starting with the name "Makefile" are buildable, while files that end with ".make" are utilities to include in other Makefiles. This reduces the amount of nearly identical targets and helps set a pattern for future community contributions for new GPU runner architectures. When we are ready to switch over to the Go runners, these files should move to the top of the repo, and we should add targets for the main CLI, as well as a helper "install" (put all the built binaries on the local system in a runnable state) and "dist" target (generate the various tar/zip files for distribution) for local developer use.

Wire up some basic sanity testing in CI for the Go runner. GPU runners are not covered yet.

This enhances the documentation for development focusing on the new Go server. After we complete the transition further doc refinements can remove the "transition" discussion.

We should tell the model that we could have full batches for all sequences. We already do this when we allocate the batches but it was missed during initialization.

This is consistent with what server.cpp currently does. It affects things like token processing counts for embedding requests.

Our integration with server.cpp implicitly disables prompt caching because it is not part of the JSON object being parsed, this makes the Go runner behavior similarly. Prompt caching has been seen to affect the results of text completions on certain hardware. The results are not wrong either way but they are non-deterministic. However, embeddings seem to be affected even on hardware that does not show this behavior for completions. For now, it is best to maintain consistency with the existing behavior.

Add system info printed at startup and quiet down noisier logging.

Adjust the flags for the new Go server to more closely match the generate flow

docs/development.md

llama/README.md

llama/example/README.md

Dockerfile.new

docs/development.md

scripts/build_docker.sh

* llama: doc and example clean up * llama: Move new dockerfile into llama dir Temporary home until we fully transition to the Go server

llama/runner/README.md

pdevine · 2024-10-07T22:03:44Z

docs/development.md

+- Git
+  - https://git-scm.com/download/win
+- GCC and Make.  There are multiple options on how to go about installing these tools on Windows.  We have verified the following, but others may work as well:  


I think Mingw was listed here before but was ripped out, so this feels a little strange.

The https://www.mingw-w64.org/downloads/ list just gives the user a bunch of downstream project options to download so it's ambiguous which one(s) work. We've been getting a steady trickle of users trying to set up Windows builds and struggling, so I wanted to be more explicit with the new instructions on a known toolchain that works.

I think that makes sense. My comment was more that the wording felt like it had changed (with the removal of mingw) so it was slightly awkward to read.

pdevine · 2024-10-07T22:06:15Z

envconfig/config.go

@@ -245,6 +247,7 @@ func AsMap() map[string]EnvVar {
 		"OLLAMA_ORIGINS":           {"OLLAMA_ORIGINS", Origins(), "A comma separated list of allowed origins"},
 		"OLLAMA_SCHED_SPREAD":      {"OLLAMA_SCHED_SPREAD", SchedSpread(), "Always schedule model across all GPUs"},
 		"OLLAMA_TMPDIR":            {"OLLAMA_TMPDIR", TmpDir(), "Location for temporary files"},
+		"OLLAMA_MULTIUSER_CACHE":   {"OLLAMA_MULTIUSER_CACHE", MultiUserCache(), "Optimize prompt caching for multi-user scenarios"},


does this need to eventually be added to cmd?

would it make more sense to just call it OLLAMA_MULTIUSER instead of giving too fine grained control over the setting? 99% of people aren't going to know whether to set this or not.

In general, my goal is for this to be a temporary flag that is eventually replaced by an implementation that is either general to all cases or is self tuning. It was intentionally left undocumented so that we aren't committing to it.

One of the problems with this type of config (besides users not knowing what to do) is that it will probably be hard to set appropriately in all but the most straightforward of cases. Real environments will have a mix of situations where different cache behaviors might be ideal. As a result, the setting is mostly useful for experimentation.

As a result, I'm a little bit hesitant to group this into a larger bucket of config since it likely makes it even more impossible to turn on and get the expected results.

llama/llama.go

pdevine · 2024-10-07T23:27:42Z

llama/llama.go

+	if !b.IsEmbedding() {
+		unsafe.Slice(b.c.token, b.batchSize)[b.c.n_tokens] = C.llama_token(token)
+	} else {
+		copy(unsafe.Slice((*float32)(b.c.embd), b.batchSize*b.embedSize)[int(b.c.n_tokens)*b.embedSize:], embed)


panic here if token is set?

pdevine · 2024-10-07T23:28:08Z

llama/llama.go

+// to include logits.
+func (b *Batch) Add(token int, embed []float32, pos int, seqIds []int, logits bool) {
+	if !b.IsEmbedding() {
+		unsafe.Slice(b.c.token, b.batchSize)[b.c.n_tokens] = C.llama_token(token)


panic here if embed is set?

llama/llama.go

dhiltgen approved these changes Jun 15, 2024

View reviewed changes

llama/ggml-cuda/alibi.cu Outdated Show resolved Hide resolved

llama/llama.go Outdated Show resolved Hide resolved

llama/llama.go Outdated Show resolved Hide resolved

llama/llama.go Outdated Show resolved Hide resolved

jmorganca commented Jun 24, 2024

View reviewed changes

llama/Makefile Outdated Show resolved Hide resolved

jmorganca commented Jun 24, 2024

View reviewed changes

llama/Makefile Outdated Show resolved Hide resolved

jmorganca commented Jun 24, 2024

View reviewed changes

llama/Makefile Outdated Show resolved Hide resolved

jmorganca commented Jun 24, 2024

View reviewed changes

llama/llama.go Outdated Show resolved Hide resolved

jmorganca commented Jun 24, 2024

View reviewed changes

llama/llama.go Outdated Show resolved Hide resolved

dhiltgen force-pushed the jmorganca/llama branch from fccc94b to 2fe9202 Compare June 26, 2024 00:09

This was referenced Jun 27, 2024

llama: Support both old and new runners with a toggle with release build rigging #5287

Closed

Is it possible to start llama server through dynamic dependency library? #5278

Closed

dhiltgen mentioned this pull request Jul 29, 2024

Add API integration tests #5678

Closed

dhiltgen force-pushed the jmorganca/llama branch from 2fe9202 to 41bf8d9 Compare July 31, 2024 17:44

dhiltgen force-pushed the jmorganca/llama branch from e584f14 to a389024 Compare August 8, 2024 22:58

This was referenced Aug 9, 2024

Inference fails with "llama_get_logits_ith: invalid logits id 7, reason: no logits" #6259

Closed

llm decode error: 500 Internal Server Error - detokenize doesn't handle unicode characters from server.cpp properly on windows #6196

Closed

dhiltgen force-pushed the jmorganca/llama branch from 05329fd to ddc3e1d Compare August 21, 2024 23:57

dhiltgen mentioned this pull request Sep 6, 2024

expose slots data through API #6670

Closed

dhiltgen force-pushed the jmorganca/llama branch from bcee395 to f851a82 Compare September 6, 2024 20:26

dhiltgen force-pushed the jmorganca/llama branch from 4255088 to b69e402 Compare September 15, 2024 18:14

This was referenced Sep 21, 2024

Issues getting rocm support to compile on Gentoo #6857

Closed

Error running latest git pull on Pi #6894

Closed

dhiltgen mentioned this pull request Sep 28, 2024

llama: add compiler tags for cpu features #7009

Closed

jmorganca and others added 9 commits September 30, 2024 11:17

doc: explain golang objc linker warning (#6830)

3d602d7

llama: gather transitive dependencies for rocm for dist packaging (#6848

57971c6

)

llama: don't create extraneous directories (#6988)

ae0c6f0

llama: Exercise the new build in CI (#6989)

4023e3a

Wire up some basic sanity testing in CI for the Go runner. GPU runners are not covered yet.

llama: Refine developer docs for Go server (#6842)

7ad6251

This enhances the documentation for development focusing on the new Go server. After we complete the transition further doc refinements can remove the "transition" discussion.

runner.go: Allocate batches for all sequences during init

3509b02

We should tell the model that we could have full batches for all sequences. We already do this when we allocate the batches but it was missed during initialization.

jessegross and others added 4 commits October 1, 2024 16:46

llm: Don't add BOS/EOS for tokenize requests

4bdf469

This is consistent with what server.cpp currently does. It affects things like token processing counts for embedding requests.

runner.go: Adjust debug log levels

d0f4ea8

Add system info printed at startup and quiet down noisier logging.

llama: fix compiler flag differences (#7082)

478a4f8

Adjust the flags for the new Go server to more closely match the generate flow

jessegross marked this pull request as ready for review October 7, 2024 17:37

jessegross approved these changes Oct 7, 2024

View reviewed changes

llama: refine developer docs (#7121)

cba565f