Skip to content

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented Sep 11, 2024

Special Note

Since this PR bumps llama.cpp past the tip of master (6026da52 as of writing this), it includes the recent changes to overhaul sampling and logging. I updated server.cpp so that it compiles and can run the models successfully. I also updated all of the patches to apply to the updated llama.cpp codebase.

Dependencies

UPDATE: This PR no longer has dependencies. The first llama.cpp PR has been merged to support granite, and given our hope to release soon, we'd like to get this merged without granitemoe support and add that in a follow-up PR.

UPDATE 2: Both granite and granitemoe are now supported in llama.cpp. I've rebased the PR to include them (and to pick up support for chameleon).

This PR is dependent on two PRs in llama.cpp:

Currently, the branch will not build since the submodule points to a commit on my fork and I have not changed the remote url. Once the llama.cpp PRs are merged, I will update the submodule pointer to the mainline.

Description

This PR adds support for IBM's granite architecture. See the llama.cpp PRs for full details on the added architectures.

Testing

In order to test this while it's in draft, I did the following:

# Download the IBM research experimental models (need huggingface-cli in python)
huggingface-cli download ibm/PowerLM-3b --local-dir $HOME/models/powerlm-3b
huggingface-cli download ibm/PowerMoE-3b --local-dir $HOME/models/powermoe-3b

# Convert to GGUF using the latest version of llama.cpp (I'm doing it here in the submodule)
cd llm/llama.cpp
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_to_gguf.py $HOME/models/powerlm-3b
python convert_to_gguf.py $HOME/models/powermoe-3b
cd -

# Build the llama-quantize binary in the submodule
cd llm/build/darwin/arm64_static/
make llama-quantize -j
cd -

# Quantize with the locally built llama-quantize
./llm/build/darwin/arm64_static/bin/llama-quantize $HOME/models/powerlm-3b Q4_K_M
./llm/build/darwin/arm64_static/bin/llama-quantize $HOME/models/powermoe-3b Q4_K_M

# Import to ollama (finally!)
echo "FROM $HOME/models/powerlm-3b/ggml-model-Q4_K_M.gguf" > Modelfile.powerlm-3b
./ollama create -f Modelfile.powerlm-3b powerlm:3b
echo "FROM $HOME/models/powermoe-3b/ggml-model-Q4_K_M.gguf" > Modelfile.powermoe-3b
./ollama create -f Modelfile.powermoe-3b powermoe:3b
Old instructions for building from my fork

build ollama

# Add my personal fork as a remote in the submodule
cd llm/llama.cpp
git remote add gabe https://github.com/gabe-l-hart/llama.cpp.git
git fetch gabe
cd -

# Generate and build like normal
go generate ./...
go build .

@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch 2 times, most recently from 3866cb0 to e40b6fc Compare September 20, 2024 02:22
@gabe-l-hart gabe-l-hart marked this pull request as ready for review September 20, 2024 02:31
@gabe-l-hart
Copy link
Contributor Author

@jmorganca Since the initial granite PR has been merged in llama.cpp, I've re-bumped llama.cpp in this PR to the tip of master. At this point, we'd like to move forward to get this merged to support granite while we wait for granitemoe in llama.cpp which I can then add as a separate PR.

@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch 2 times, most recently from 37901c9 to 675fac9 Compare September 25, 2024 13:30
@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch from 675fac9 to 0cdeed0 Compare October 1, 2024 20:19
@gabe-l-hart
Copy link
Contributor Author

@jmorganca I've rebased this PR again to the latest tip of master in llama.cpp (3f1ae2e3). This includes adding support for chameleon which was added in llama.cpp since the last rebase.

I also added a script to help with the process of updating the patches when bumping llama.cpp. I'm happy to remove it, but thought it might be worth contributing in case anyone else runs into the need to rebase frequently when bumping llama.cpp to support a new model.

@dhiltgen
Copy link
Collaborator

dhiltgen commented Oct 7, 2024

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

@gabe-l-hart
Copy link
Contributor Author

Hi @dhiltgen, thanks for the heads up! That PR looks like a really cool improvement. My PR was simply aimed at bumping the usage of llama.cpp to a point in history that supports the granite/granitemoe architectures. Since there were a number of refactors over there since the point that is currently checked into the submodule, I made a bunch of changes in server.cpp to try to account for those refactors. I'll look over #5034, but my hope is that things should "just work" without the need for this PR at all after that.

@gabe-l-hart
Copy link
Contributor Author

Looking over #5034, I'm curious what the plan will be going forward for staying in sync with llama.cpp. In that PR, it looks like the library is fully vendored (as opposed to pulled in with a submodule and patches). Is the plan to move towards this kind of a vendored relationship going forward? Will there be a paved road for updating the vendored copy when changes are added to the upstream project?

As it stands, the version in that PR does not move past the point in llama.cpp history to add support for granite/granitemoe. I can take a whack at recreating this PR on top of the vendored copy in llama. I suspect we'll run into similar refactor conflicts to the ones I hit in this PR, so I'm curious if you are already tackling the work to bump the vendored copy of llama.cpp?

@dhiltgen
Copy link
Collaborator

dhiltgen commented Oct 7, 2024

@gabe-l-hart we've temporarily paused updating llama.cpp to reduce churn as we work to bring #5034 across the finish line. That's just temporary until we merge, then we'll resume regular updates to pick up the latest upstream fixes and enhancements. Our server.cpp code had drifted a bit from upstream which was making it more difficult to do these updates, but the new Go equivalent should aid in keeping the vendored code fresh.

@gabe-l-hart
Copy link
Contributor Author

Ok, that's great to hear! This PR was entirely trying to rectify that drift relative to the tip of llama.cpp, so I'm glad to hear that this should get easier in the future. I'll plan to hold on this PR until #5034 is resolved, then will look to see if there's any additional work I can help with to get the granite architectures supported.

@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch 5 times, most recently from 19f565c to ccb76c0 Compare October 14, 2024 18:56
@gabe-l-hart
Copy link
Contributor Author

@dhiltgen @jmorganca I think I've got everything working now with the new llama module!

Open Questions

  • Does the Context.SampleTokenGreedy method get used somewhere that I'm missing?
    • This one was a tough port because the llama_sampler_sample function hides the implementation of extracting the logits from the context.
    • The solution I have compiles, but I'm not sure how to actually test that it works
  • I added a helper script to help with updating all of the llama/patches, but one of the side effects is that the hashes in the diffs are a little off since they're made off of my local tmp branch I was using to sequentially apply and diff the changes. Do you see this causing any problems down the road?
  • I couldn't fully trace why, but somewhere in the chain of #includes, the ggml-impl.h header got dropped, so I added a new patch to add it to llama.cpp. Do either of you have any thoughts on where this might have dropped out and if there's a cleaner way to avoid losing it?
  • From what I can tell, the sampling v2 refactor in llama.cpp fully removed support for Classifier-Free Guidance which was previously there. I'm not familiar with what it actually does, and it looks like ollama was not using it, but I wanted to double check that this is ok to reflect in the runner's sampling code.
  • I don't have any non-Mac hardware, so I have no idea if I broke anything behind the various #ifdefs or other platform-specific compilation routes!

Testing

My workflow for testing:

1. Get the sample models converted and imported as above

2. Build the llama runner

cd llama
# NOTE: I need to add these flags to avoid version warnings on macOS 14.5
CGO_CFLAGS="-Wno-unguarded-availability-new -mmacosx-version-min=11.3" make -j

3. Run the runner

./build/darwin-arm64/runners/metal/ollama_llama_server -port 9090 -model path/to/model.gguf

4. Run a sample call

curl -s -X POST -H "Content-Type: application/json" -d '{"prompt": "This is the story of a developer and their dog.\n"}' http://localhost:9090/completion

@jessegross jessegross mentioned this pull request Oct 15, 2024
@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch from 5b2ca19 to a0e19e4 Compare October 15, 2024 19:58
@gabe-l-hart
Copy link
Contributor Author

Thanks for the review @jessegross! I think I've addressed all the comments (will let you hit Resolve if you agree).

@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch from 60477ca to aedf99b Compare October 16, 2024 18:33
Copy link
Contributor

@jessegross jessegross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also rebase on top of main? That should fix your MacOS issue. I think you can also squash the patches at the same time - that will make it easier to rebase and we don't need all the history once this has been merged.

This was a fairly large changeset. I closely followed the changes here:
ggml-org/llama.cpp@df270ef

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…clude

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The changes here reflect the changes made in the big llama.cpp sampling PR
ggml-org/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart gabe-l-hart force-pushed the IBMGraniteArchitectureSupport branch from 45cce06 to a85207f Compare October 17, 2024 03:52
gabe-l-hart and others added 3 commits October 16, 2024 22:11
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes ollama#6707
@gabe-l-hart
Copy link
Contributor Author

@jessegross Ok, 🏓 back to you! I'm happy to further rebase/adjust history as needed depending on your preference for how much branch context to keep. I know this one got pretty long!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@jessegross jessegross merged commit f2890a4 into ollama:main Oct 17, 2024
20 checks passed
@gabe-l-hart gabe-l-hart deleted the IBMGraniteArchitectureSupport branch October 17, 2024 19:00
@josiahbryan
Copy link

Does this support log probs in the API yet?

MaciejMogilany pushed a commit to Maciej-Mogilany/ollama that referenced this pull request Nov 12, 2024
* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
ggml-org/llama.cpp@df270ef

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
ggml-org/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes ollama#6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants