IBM granite/granitemoe architecture support #6760

gabe-l-hart · 2024-09-11T18:59:32Z

Special Note

Since this PR bumps llama.cpp past the tip of master (6026da52 as of writing this), it includes the recent changes to overhaul sampling and logging. I updated server.cpp so that it compiles and can run the models successfully. I also updated all of the patches to apply to the updated llama.cpp codebase.

Dependencies

UPDATE: This PR no longer has dependencies. The first llama.cpp PR has been merged to support granite, and given our hope to release soon, we'd like to get this merged without granitemoe support and add that in a follow-up PR.

UPDATE 2: Both granite and granitemoe are now supported in llama.cpp. I've rebased the PR to include them (and to pick up support for chameleon).

~~This PR is dependent on two PRs in llama.cpp:~~

Support for granite: IBM Granite Architecture ggml-org/llama.cpp#9412
Support for granitemoe: IBM Granite MoE Architecture ggml-org/llama.cpp#9438

Currently, the branch will not build since the submodule points to a commit on my fork and I have not changed the remote url. Once the llama.cpp PRs are merged, I will update the submodule pointer to the mainline.

Description

This PR adds support for IBM's granite architecture. See the llama.cpp PRs for full details on the added architectures.

Testing

In order to test this while it's in draft, I did the following:

# Download the IBM research experimental models (need huggingface-cli in python)
huggingface-cli download ibm/PowerLM-3b --local-dir $HOME/models/powerlm-3b
huggingface-cli download ibm/PowerMoE-3b --local-dir $HOME/models/powermoe-3b

# Convert to GGUF using the latest version of llama.cpp (I'm doing it here in the submodule)
cd llm/llama.cpp
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_to_gguf.py $HOME/models/powerlm-3b
python convert_to_gguf.py $HOME/models/powermoe-3b
cd -

# Build the llama-quantize binary in the submodule
cd llm/build/darwin/arm64_static/
make llama-quantize -j
cd -

# Quantize with the locally built llama-quantize
./llm/build/darwin/arm64_static/bin/llama-quantize $HOME/models/powerlm-3b Q4_K_M
./llm/build/darwin/arm64_static/bin/llama-quantize $HOME/models/powermoe-3b Q4_K_M

# Import to ollama (finally!)
echo "FROM $HOME/models/powerlm-3b/ggml-model-Q4_K_M.gguf" > Modelfile.powerlm-3b
./ollama create -f Modelfile.powerlm-3b powerlm:3b
echo "FROM $HOME/models/powermoe-3b/ggml-model-Q4_K_M.gguf" > Modelfile.powermoe-3b
./ollama create -f Modelfile.powermoe-3b powermoe:3b

Old instructions for building from my fork

build ollama

# Add my personal fork as a remote in the submodule
cd llm/llama.cpp
git remote add gabe https://github.com/gabe-l-hart/llama.cpp.git
git fetch gabe
cd -

# Generate and build like normal
go generate ./...
go build .

gabe-l-hart · 2024-09-20T02:33:19Z

@jmorganca Since the initial granite PR has been merged in llama.cpp, I've re-bumped llama.cpp in this PR to the tip of master. At this point, we'd like to move forward to get this merged to support granite while we wait for granitemoe in llama.cpp which I can then add as a separate PR.

gabe-l-hart · 2024-10-01T20:21:31Z

@jmorganca I've rebased this PR again to the latest tip of master in llama.cpp (3f1ae2e3). This includes adding support for chameleon which was added in llama.cpp since the last rebase.

I also added a script to help with the process of updating the patches when bumping llama.cpp. I'm happy to remove it, but thought it might be worth contributing in case anyone else runs into the need to rebase frequently when bumping llama.cpp to support a new model.

dhiltgen · 2024-10-07T19:21:00Z

Thanks for taking the time to post a PR. I noticed you've made some changes to server.cpp, so I wanted to let you know that we're about to merge another PR (#5034) to begin replacing that code with a new Go based equivalent. The goal of this is to add more unit testing and fix some long standing stability bugs while preserving vision model support. If you need help rebasing your PR don't hesitate to contact us by replying here.

gabe-l-hart · 2024-10-07T19:28:22Z

Hi @dhiltgen, thanks for the heads up! That PR looks like a really cool improvement. My PR was simply aimed at bumping the usage of llama.cpp to a point in history that supports the granite/granitemoe architectures. Since there were a number of refactors over there since the point that is currently checked into the submodule, I made a bunch of changes in server.cpp to try to account for those refactors. I'll look over #5034, but my hope is that things should "just work" without the need for this PR at all after that.

gabe-l-hart · 2024-10-07T19:41:35Z

Looking over #5034, I'm curious what the plan will be going forward for staying in sync with llama.cpp. In that PR, it looks like the library is fully vendored (as opposed to pulled in with a submodule and patches). Is the plan to move towards this kind of a vendored relationship going forward? Will there be a paved road for updating the vendored copy when changes are added to the upstream project?

As it stands, the version in that PR does not move past the point in llama.cpp history to add support for granite/granitemoe. I can take a whack at recreating this PR on top of the vendored copy in llama. I suspect we'll run into similar refactor conflicts to the ones I hit in this PR, so I'm curious if you are already tackling the work to bump the vendored copy of llama.cpp?

dhiltgen · 2024-10-07T20:24:58Z

@gabe-l-hart we've temporarily paused updating llama.cpp to reduce churn as we work to bring #5034 across the finish line. That's just temporary until we merge, then we'll resume regular updates to pick up the latest upstream fixes and enhancements. Our server.cpp code had drifted a bit from upstream which was making it more difficult to do these updates, but the new Go equivalent should aid in keeping the vendored code fresh.

gabe-l-hart · 2024-10-07T20:47:50Z

Ok, that's great to hear! This PR was entirely trying to rectify that drift relative to the tip of llama.cpp, so I'm glad to hear that this should get easier in the future. I'll plan to hold on this PR until #5034 is resolved, then will look to see if there's any additional work I can help with to get the granite architectures supported.

gabe-l-hart · 2024-10-14T22:53:57Z

@dhiltgen @jmorganca I think I've got everything working now with the new llama module!

Open Questions

Does the Context.SampleTokenGreedy method get used somewhere that I'm missing?
- This one was a tough port because the llama_sampler_sample function hides the implementation of extracting the logits from the context.
- The solution I have compiles, but I'm not sure how to actually test that it works
I added a helper script to help with updating all of the llama/patches, but one of the side effects is that the hashes in the diffs are a little off since they're made off of my local tmp branch I was using to sequentially apply and diff the changes. Do you see this causing any problems down the road?
I couldn't fully trace why, but somewhere in the chain of #includes, the ggml-impl.h header got dropped, so I added a new patch to add it to llama.cpp. Do either of you have any thoughts on where this might have dropped out and if there's a cleaner way to avoid losing it?
From what I can tell, the sampling v2 refactor in llama.cpp fully removed support for Classifier-Free Guidance which was previously there. I'm not familiar with what it actually does, and it looks like ollama was not using it, but I wanted to double check that this is ok to reflect in the runner's sampling code.
I don't have any non-Mac hardware, so I have no idea if I broke anything behind the various #ifdefs or other platform-specific compilation routes!

Testing

My workflow for testing:

1. Get the sample models converted and imported as above

2. Build the `llama` runner

cd llama
# NOTE: I need to add these flags to avoid version warnings on macOS 14.5
CGO_CFLAGS="-Wno-unguarded-availability-new -mmacosx-version-min=11.3" make -j

3. Run the runner

./build/darwin-arm64/runners/metal/ollama_llama_server -port 9090 -model path/to/model.gguf

4. Run a sample call

curl -s -X POST -H "Content-Type: application/json" -d '{"prompt": "This is the story of a developer and their dog.\n"}' http://localhost:9090/completion

llama/patches/update_patch.sh

llama/llama.go

llama/runner/README.md

llama/sync.sh

gabe-l-hart · 2024-10-16T13:06:27Z

Thanks for the review @jessegross! I think I've addressed all the comments (will let you hit Resolve if you agree).

llama/sampling_ext.cpp

jessegross

Can you also rebase on top of main? That should fix your MacOS issue. I think you can also squash the patches at the same time - that will make it easier to rebase and we don't need all the history once this has been merged.

llama/sampling_ext.cpp

llama/llama.go

llama/patches/05-embeddings.diff

llm/ext_server/server.cpp

This was a fairly large changeset. I closely followed the changes here: ggml-org/llama.cpp@df270ef Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>