Skip to content

server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Aug 11, 2025

Conversation

dan-and
Copy link
Contributor

@dan-and dan-and commented May 12, 2025

This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:

  • Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
  • Allowing unallocated GPUs to get into power-saving mode.
  • Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below)
  • Due to the reduced memory allocation, you can run more models simultaneously.

How to use:

  • OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
  • OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model.

Tests and Environment:

Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM.

When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts.
Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model.

Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:

In this example, we configure Ollama to num_parallel=2, num_ctx=8192

Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB
Overall required VRAM when spreading over:
1x GPU: 5.7 GiB
2x GPU: 7.8 GiB
3x GPU: 9.3 GiB
4x GPU: 10.8 GiB
5x GPU: 12.3 GiB
6x GPU: 13.9 GiB
7x GPU: 15.4 GiB

a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus.
3x GPU: 27.2 GiB
4x GPU: 30.1 GiB
5x GPU: 32.9 GiB
6x GPU: 35.8 GiB
7x GPU: 38.7 GiB

Results of Power consumption measured "at the wall" and eval rates

Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: 310 Watt with 81 fps

Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: 320 Watt with 51 fps

(Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant)

gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : 250 Watt with 70 fps

(gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)

…uirements.

This can be helpful for systems with multiple GPUs which are not completely utilized.
Use OLLAMA_SCHED_SPREAD=0 (und keep it unset) to use only one GPU and then evenly distribute the load across all GPUs. (old default behavior)
Use OLLAMA_SCHED_SPREAD=2 to keep the amount of GPUs to a minimum and only use as many as needed to run the model.
Use OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
@dan-and dan-and changed the title server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe Speed Performance gains) server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe-speed performance gains) May 12, 2025
@dan-and
Copy link
Contributor Author

dan-and commented May 16, 2025

Is there anything I can do to get it reviewed?

@dan-and dan-and changed the title server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe-speed performance gains) server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) Jun 5, 2025
@dan-and
Copy link
Contributor Author

dan-and commented Jun 5, 2025

@rick-github , thanks for pointing me to the VRAM allocation effect of my patch, as this is next to the power-consumption a huge benefit.

There are examples of how much VRAM overhead ollama is consuming, when we are using the tactic of spreading the model over all available GPUs instead of grouping them.

@dan-and dan-and requested review from preechapon-cmd and Re2906 June 6, 2025 09:38
@dan-and
Copy link
Contributor Author

dan-and commented Jun 23, 2025

Hi @jmorganca and @mxyng

Do you see major issues in my patch, which may prevent including it into the master branch? If so, please give me feedback so that I can update accordingly

Kind regards

@jessegross
Copy link
Contributor

I don't think that we need to introduce a new configuration setting for this. We can make it so that the default setting packs the model onto the minimum number of GPUs required and then OLLAMA_SCHED_SPREAD=1 splits it evenly over all models.

@dan-and
Copy link
Contributor Author

dan-and commented Jul 9, 2025

Hi Jesse,
thanks for the feedback.
I will rearrange my patch so that it is the default behavior. Give me 2-3 days and it will be updated.

@dan-and
Copy link
Contributor Author

dan-and commented Jul 11, 2025

@jessegross
Updated to make it the default behavior, and OLLAMA_SCHED_SPREAD=1 will spread it evenly.

Because of my patch, I have introduced two slog.Info and one slog.Debug into the code. It helped me quite well to figure out if ollama is doing the correct allocation with various system setups and configs. Should I keep them in the code, or would you suggest removing them or moving the slog.Info to slog.Debug to keep the amount of logs low?

Regards,
Daniel

PS: It works great on my 7 GPU system:

OLLAMA_SCHED_SPREAD=1

daniel@gpu:~/source/ollama_dan$ ollama run qwen3:4b --verbose "say hi!"
Thinking...
Okay, the user said "say hi!" so I need to respond with a friendly greeting. Let me make sure to keep it warm and welcoming. Maybe add an emoji to make it more lively. I should also ask how I can assist them today. Let me
check the tone—should be positive and eager to help. Alright, that should do it.
...done thinking.

Hello! 😊 How can I assist you today?

total duration: 11.976054545s
load duration: 10.255610156s
prompt eval count: 13 token(s)
prompt eval duration: 358.652306ms
prompt eval rate: 36.25 tokens/s
eval count: 88 token(s)
eval duration: 1.36080978s
eval rate: 64.67 tokens/s

daniel@gpu:~/source/ollama_dan$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:4b 2bfd38a7daaf 20 GB 100% GPU 8192 29 minutes from now

OLLAMA_SCHED_SPREAD=0 or not set at all

daniel@gpu:~/source/ollama_dan$ ollama run qwen3:4b --verbose "say hi!"
Thinking...
Okay, the user said "say hi!" so I need to respond in a friendly and welcoming manner. Let me start by acknowledging their greeting. Maybe use an emoji to keep it light and approachable.
I should mention my name, Qwen, to establish identity. Then, offer assistance by asking how I can help them today. Keep it concise but open-ended so they feel comfortable to share their
needs. Make sure the tone is positive and eager to help. Let me put that all together.
...done thinking.

Hello! I'm Qwen, your helpful assistant. How can I assist you today? 😊

total duration: 5.474187439s
load duration: 3.6653167s
prompt eval count: 13 token(s)
prompt eval duration: 139.89265ms
prompt eval rate: 92.93 tokens/s
eval count: 125 token(s)
eval duration: 1.668211081s
eval rate: 74.93 tokens/s

daniel@gpu:~/source/ollama_dan$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:4b 2bfd38a7daaf 7.3 GB 100% GPU 8192 29 minutes from now

@jessegross
Copy link
Contributor

Yes, I think you can remove the extra logging and the memory calculations that go into them. Please also retain existing comments and TODOs that aren't affected by your change.

Since you are iterating from smallest to largest number of GPUs, once you find a set that fits, you can just return that. You don't need to keep looking and find the best one - no later tries are going to use fewer GPUs.

I have some in-progress code that does something similar for a different code path, you can use it as a reference:

if !envconfig.SchedSpread() {

dan-and added 2 commits July 12, 2025 21:33
restored unsolved TODOs in sched.go
aligned logic with assignLayers function from server.go and reducing unnecessary gpu matchings
@dan-and
Copy link
Contributor Author

dan-and commented Jul 12, 2025

@jessegross
Thanks for the heads-up. I aligned the logic to be similar to your assignLayer concept. It's much cleaner now.

Still, my GPU selection is based on the one with the largest VRAM available at that moment. This will usually separate models from the GPUs to allow power savings when a model is idling.

I gave it a few rounds with 20 different models and sizes, and it works as intended.

The incorrectly removed TODOs are back in the code. Sorry for that.

@dan-and
Copy link
Contributor Author

dan-and commented Jul 25, 2025

@jessegross any additional comments and hints for me?

server/sched.go Outdated
var totalFreeMemory uint64
for _, g := range gpuSubset {
totalFreeMemory += g.FreeMemory
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this calculation of totalFreeMemory is unused.

It would be good to keep the existing log message about loading the model though - you can just print something similar to the current multi-GPU case, where it doesn't display the free memory. i.e.
slog.Info("new model will fit in available VRAM, loading", "model", req.model.ModelPath, "library", sgl[0].Library, "parallel", p, "required", format.HumanBytes2(estimatedVRAM))

server/sched.go Outdated
@@ -795,13 +798,14 @@ func pickBestFullFitByLibrary(req *LlmRequest, f *ggml.GGML, gpus discover.GpuIn
// - if multiple Libraries, see if any single GPU in any Library will fit
// - try subsets of GPUs instead of just falling back to 1 or all in a family

// Now try all the GPUs
// Now try all the GPUS (OLLAMA_SCHED_SPREAD is set)
for _, p := range numParallelToTry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block can be an else on if !envconfig.SchedSpread() {. No point in going through a loop only to check a constant condition in the middle.

I would also keep the existing log message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, @jessegross , I readded the existing log files and replaced the if statement. I have overseen that from the development versions, when I had all 3 variants included.

Hope it it fine for you now.

Cheers

Copy link
Contributor

@jessegross jessegross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version looks good, thank you for your patience. I did make one change to your branch for the sake of reducing iterations - I removed the totalFreeMemory calculation, as this was unused. Hope you don't mind.

@jessegross jessegross merged commit ea7657b into ollama:main Aug 11, 2025
8 checks passed
rick-github pushed a commit to rick-github/ollama that referenced this pull request Aug 20, 2025
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
rick-github pushed a commit to rick-github/ollama that referenced this pull request Aug 20, 2025
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:
 - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
 - Allowing unallocated GPUs to get into power-saving mode.
 - Significantly reduce VRAM allocation when using more than 2 GPUs in a system
 - Due to the reduced memory allocation, you can run more models simultaneously.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants