Skip to content

Vulkan support (replacing pull/5059) #9650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open

Conversation

grinco
Copy link

@grinco grinco commented Mar 11, 2025

This pull request is based on #5059, and whyvl#7

Tested on the v0.5.13 on linux. Image was built using the supplied Dockerfile with a caveat that release image was bumped to 24.04 (from 20.04).

Build command:

docker buildx build --platform linux/amd64 ${OLLAMA_COMMON_BUILD_ARGS} -t grinco/ollama-amd-apu:vulkan .

Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled

[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-03-11T13:00:40.793Z level=INFO source=gpu.go:199 msg="vulkan: load libvulkan and libcap ok"
time=2025-03-11T13:00:40.877Z level=INFO source=gpu.go:421 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:443 msg="amdgpu detected, but no compatible rocm library found.  Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:348 msg="unable to verify rocm library: no suitable rocm found, falling back to CPU"
time=2025-03-11T13:00:40.879Z level=INFO source=types.go:137 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon Graphics (RADV GFX1103_R1)" total="15.6 GiB" available="15.6 GiB"
 # ollama run phi4:14b
>>> /set verbose
Set 'verbose' mode.
>>> how's it going?
Hello! I'm here to help you with any questions or tasks you have. How can I assist you today? 😊

total duration:       3.341959745s
load duration:        18.165612ms
prompt eval count:    15 token(s)
prompt eval duration: 475ms
prompt eval rate:     31.58 tokens/s
eval count:           26 token(s)
eval duration:        2.846s
eval rate:            9.14 tokens/s
>>>

MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: d75a173b8618a2ce35287663ffc6f75779e7b265).
MingcongBai added a commit to AOSC-Dev/aosc-os-abbs that referenced this pull request Apr 16, 2025
- Backport Vulkan backend support from ollama/ollama#9650.
- Track patches at AOSC-Tracking/ollama @ aosc/v0.6.5
  (HEAD: 31a866457d350d17de839986c105312bcf8eb0e6).
@juls0730

This comment was marked as outdated.

@juls0730
Copy link

juls0730 commented Apr 29, 2025

I cant seem to add an issue to the repo itself, so I will report my findings here.

I'm running an rx580, when I increase the context window to around 20k is when it starts, I get complete garbage responses, with mistral-nemo, I get- uh, whatever this is

mistral-nemo nonesense
<SPECIAL_24>[AVAILABLE_TOOLS][MIDDLE]<SPECIAL_26><SPECIAL_15>[SUFFIX]<SPECIAL_32>[A<SPECIAL_24>[AVAILABLE_TOOLS][MIDDLE]<SPECIAL_26><SPECIAL_15>[SUFFIX]<SPECIAL_32>[AVAILABLE_TOOLS][TOOL_RESULTS]<SPECIAL_20><SPECIAL_16><SPECIAL_20><SPECIAL_25><SPECIAAILABLE_TOOLS][TOOL_RESULTS]<SPECIAL_20><SPECIAL_16><SPECIAL_20><SPECIAL_25><SPECIAL_27><SPECIAL_22><SPECIAL_38><SPECIAL_17><SPECIAL_25><unk><SPECIAL_25><SPECIAL_33>[S_27><SPECIAL_22><SPECIAL_38><SPECIAL_17><SPECIAL_25><unk><SPECIAL_25><SPECIAL_33>[SUFFIX]<unk><SPECIAL_39><SPECIAL_19>[SUFFIX]

And when running llama3.2:3b and llama3.2:3b-instruct-q4_K_S I just get the letter "G" repeated a bunch. I'm running ollama with the launch instructions provided on your docker image for this PR:

docker run -d --restart=always \
        -v ollama:/root/.ollama \
        -p 11434:11434 --name ollama_vulkan-grinco \
        -v /opt/amdgpu/:/opt/amdgpu/ \
        --device /dev/kfd --device /dev/dri \
        --cap-add CAP_PERFMON \
        -e OLLAMA_FLASH_ATTENTION=1 \
        -e OLLAMA_KV_CACHE_TYPE="q8_0" \
        docker.io/grinco/ollama-amd-apu:vulkan

With a context below 20k it seems fine. I'm also able to use ROCm by following this just fine #2453 (comment) (albeit, slower than with vulkan, but at least its stable lmao)

@SergeyFilippov
Copy link

I cant seem to add an issue to the repo itself, so I will report my findings here.

I'm running an rx580, when I increase the context window to around 20k is when it starts, I get complete garbage responses, with mistral-nemo, I get- uh, whatever this is

mistral-nemo nonesense
And when running llama3.2:3b and llama3.2:3b-instruct-q4_K_S I just get the letter "G" repeated a bunch. I'm running ollama with the launch instructions provided on your docker image for this PR:

docker run -d --restart=always \
        -v ollama:/root/.ollama \
        -p 11434:11434 --name ollama_vulkan-grinco \
        -v /opt/amdgpu/:/opt/amdgpu/ \
        --device /dev/kfd --device /dev/dri \
        --cap-add CAP_PERFMON \
        -e OLLAMA_FLASH_ATTENTION=1 \
        -e OLLAMA_KV_CACHE_TYPE="q8_0" \
        docker.io/grinco/ollama-amd-apu:vulkan

With a context below 20k it seems fine. I'm also able to use ROCm by following this just fine #2453 (comment) (albeit, slower than with vulkan, but at least its stable lmao)

As far as I've understood it is currently not recommended to use flash attention and context quantization while running Vulkan. There are multiple issues with it in underlying llama.cpp.

@juls0730
Copy link

juls0730 commented May 6, 2025

I cant seem to add an issue to the repo itself, so I will report my findings here.
I'm running an rx580, when I increase the context window to around 20k is when it starts, I get complete garbage responses, with mistral-nemo, I get- uh, whatever this is
mistral-nemo nonesense
And when running llama3.2:3b and llama3.2:3b-instruct-q4_K_S I just get the letter "G" repeated a bunch. I'm running ollama with the launch instructions provided on your docker image for this PR:

docker run -d --restart=always \
        -v ollama:/root/.ollama \
        -p 11434:11434 --name ollama_vulkan-grinco \
        -v /opt/amdgpu/:/opt/amdgpu/ \
        --device /dev/kfd --device /dev/dri \
        --cap-add CAP_PERFMON \
        -e OLLAMA_FLASH_ATTENTION=1 \
        -e OLLAMA_KV_CACHE_TYPE="q8_0" \
        docker.io/grinco/ollama-amd-apu:vulkan

With a context below 20k it seems fine. I'm also able to use ROCm by following this just fine #2453 (comment) (albeit, slower than with vulkan, but at least its stable lmao)

As far as I've understood it is currently not recommended to use flash attention and context quantization while running Vulkan. There are multiple issues with it in underlying llama.cpp.

IIRC I tried running the model without KV quantitization and flash attention, and it still had issues, but I will try again and make sure. That seemingly fixes the issue, thanks @SergeyFilippov!

@juls0730
Copy link

juls0730 commented May 8, 2025

This vulkan PR is pretty nice, loading up models with a big context window seems faster and the time to first token is much shorter. There is also a nice performance uplift compared to my previous solution (#2453 (comment)). I hope this gets more attention eventually.

@SergeyFilippov
Copy link

Hi, @jmorganca,
I know you guys have a lot of work to do and defined priorities, but speaking on behalf on a part of community who don't have CUDA or ROCm available, Vulkan is a groundbreaking improvement to LLM performance (as owner of rx 9070 series, where there are no ROCm but 16 gigs of vram and lot's of flops).

I do understand, that ollama's goal is simplicity for a user, and maintaining multiple backends is a burden, but it would be still nice to know:

  1. if there is a chance of getting official vulkan support in future?
  2. if so, what can we do or improve to make it happen?
  3. what would it take to just hide this backend behind some experimental flag and provide "as-is" llama.cpp vulkan engine?

Thanks in advance.

P.S. it might make sense to pin the answere/statement about Vulkan somewhere, so those questions won't get asked over and over again.

@machiav3lli
Copy link

machiav3lli commented May 15, 2025

I'd guess a major release like 0.7.0 would be a good point to integrate the community work on such feature (and for the community to push for this when necessary).

@virajwad
Copy link

virajwad commented Jul 1, 2025

Is this PR in current state buildable on Windows?

If I try cmake -B build -DGGML_VULKAN=OFF I can build, but if I try cmake -B build -DGGML_VULKAN=ON, then I get the following error:

image

Or maybe I could get feedback please if I'm building wrong?

@chilman408
Copy link

I installed this as an Docker App on my Unraid. This implementation is better than the Rocm version for my AMD 780m APU because I tried the Rocm version and it would work for a few minutes and then the GPU crashes. This version using Vulkan doesn't run into those "gpu hang" issues.

However, I'm getting errors when installing the latest DeepSeek or Qwen3 models where it says I need to update Ollama to use these models...

@juls0730
Copy link

However, I'm getting errors when installing the latest DeepSeek or Qwen3 models where it says I need to update Ollama to use these models...

I believe is because the PR is based on an old version of ollama, I believe there is no way for you to fix this unless you update the patch to work with the latest version of ollama, which I think would be quite time consuming.

@grinco
Copy link
Author

grinco commented Jul 23, 2025

However, I'm getting errors when installing the latest DeepSeek or Qwen3 models where it says I need to update Ollama to use these models...

I believe is because the PR is based on an old version of ollama, I believe there is no way for you to fix this unless you update the patch to work with the latest version of ollama, which I think would be quite time consuming.

Yes, that's the reason. This PR, and its and its predecessor were submitted quite a while ago (4 months and 1 year respectively), and keeping up with the main branch to fix breaking changes not knowing if it will ever be merged is not something that I'll be investing time into.

@yeahdongcn
Copy link
Contributor

Any chance to sync with the upstream code for both ollama and llama.cpp? Thanks.

@grinco
Copy link
Author

grinco commented Aug 7, 2025

There are multiple conflicts that need to be resolved - some of them - in the go code - which is going to require someone familiar with the matter to have a look at. The codebase drifted apart from the vulkan fork far too much for me to be able to address it without considerable amount of effort understanding the codebase. Maybe someone more knowledgeable can contribute. I see some conversation going on on whyvl#7 (comment) - however, it also seems to be stuck at an older version (v0.9.3) - I personally don't see any value in putting more work into this until it merges - or someone will create a vulkan fork they will maintain.

@yeahdongcn
Copy link
Contributor

I’ve added Vulkan backend support in https://github.com/MooreThreads/ollama-musa (as I'm maintaining an Ollama fork for supporting MooreThreads GPU), which is based on Ollama v0.11.4.

The latest multi-arch (amd64 and arm64) Docker image for the Vulkan backend is docker.io/mthreads/ollama:0.11.4-vulkan. I’ve tested it on MTGPU, and it works well.

I’d like to test this on a virtual machine with AMDGPU (via AMD Developer Cloud), but I couldn’t find the Vulkan ICD on that machine. I noticed you’ve tested it on AMDGPU, so I’m wondering if you could share any instructions or tips.

@chilman408
Copy link

I’ve added Vulkan backend support in https://github.com/MooreThreads/ollama-musa (as I'm maintaining an Ollama fork for supporting MooreThreads GPU), which is based on Ollama v0.11.4.

The latest multi-arch (amd64 and arm64) Docker image for the Vulkan backend is docker.io/mthreads/ollama:0.11.4-vulkan. I’ve tested it on MTGPU, and it works well.

I’d like to test this on a virtual machine with AMDGPU (via AMD Developer Cloud), but I couldn’t find the Vulkan ICD on that machine. I noticed you’ve tested it on AMDGPU, so I’m wondering if you could share any instructions or tips.

Oh for sure, let me check it out!

@yeahdongcn
Copy link
Contributor

yeahdongcn commented Aug 15, 2025

I’d like to test this on a virtual machine with AMDGPU (via AMD Developer Cloud), but I couldn’t find the Vulkan ICD on that machine. I noticed you’ve tested it on AMDGPU, so I’m wondering if you could share any instructions or tips.

I have installed the vulkan-amdgpu, and it turns out the icd file is now available.

vulkan-amdgpu/noble,now 25.10-2165406.24.04 amd64 [installed]
  AMDGPU Vulkan driver

But running vulkaninfo results following error:

root@0-4-35-gpu-mi300x1-192gb-devcloud-atl1:~# vulkaninfo
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /opt/amdgpu/lib/x86_64-linux-gnu/amdvlk64.so. Skipping this driver.
'DISPLAY' environment variable not set... skipping surface info
radv/amdgpu: Failed to allocate a buffer:
radv/amdgpu:    size      : 0 bytes
radv/amdgpu:    alignment : 0 bytes
radv/amdgpu:    domains   : 2
Segmentation fault (core dumped)

@yeahdongcn
Copy link
Contributor

yeahdongcn commented Aug 20, 2025

I just pushed the latest tag mthreads/ollama:0.11.5-vulkan (repo: https://github.com/MooreThreads/ollama-musa), which is based on the Ollama v0.11.5 code.

This has been tested on MooreThreads MTT S80, Intel Arc A770, and AMD 780M.
MooreThreads#22
whyvl#26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.