server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

dan-and · 2025-05-12T22:27:50Z

This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.

Benefits:

Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
Allowing unallocated GPUs to get into power-saving mode.
Significantly reduce VRAM allocation when using more than 2 GPUs in a system (see below)
Due to the reduced memory allocation, you can run more models simultaneously.

How to use:

OLLAMA_SCHED_SPREAD=0 (or keep it unset) to use only one GPU, and then evenly distribute the load across all GPUs. (old default behavior)
OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
OLLAMA_SCHED_SPREAD=2 to keep the number of GPUs to a minimum and only use as many as needed to run the model.

Tests and Environment:

Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM.

When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts.
Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model.

Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:

In this example, we configure Ollama to num_parallel=2, num_ctx=8192

Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB
Overall required VRAM when spreading over:
1x GPU: 5.7 GiB
2x GPU: 7.8 GiB
3x GPU: 9.3 GiB
4x GPU: 10.8 GiB
5x GPU: 12.3 GiB
6x GPU: 13.9 GiB
7x GPU: 15.4 GiB

a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus.
3x GPU: 27.2 GiB
4x GPU: 30.1 GiB
5x GPU: 32.9 GiB
6x GPU: 35.8 GiB
7x GPU: 38.7 GiB

Results of Power consumption measured "at the wall" and eval rates

Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: 310 Watt with 81 fps

Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: 320 Watt with 51 fps

(Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant)

gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : 250 Watt with 70 fps

(gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)

…uirements. This can be helpful for systems with multiple GPUs which are not completely utilized. Use OLLAMA_SCHED_SPREAD=0 (und keep it unset) to use only one GPU and then evenly distribute the load across all GPUs. (old default behavior) Use OLLAMA_SCHED_SPREAD=2 to keep the amount of GPUs to a minimum and only use as many as needed to run the model. Use OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)

dan-and · 2025-05-16T22:10:42Z

Is there anything I can do to get it reviewed?

…lways calculated the overall memory requirement for all GPUs, which led to not fully utilized

dan-and · 2025-06-05T23:18:53Z

@rick-github , thanks for pointing me to the VRAM allocation effect of my patch, as this is next to the power-consumption a huge benefit.

There are examples of how much VRAM overhead ollama is consuming, when we are using the tactic of spreading the model over all available GPUs instead of grouping them.

dan-and · 2025-06-23T09:41:30Z

Hi @jmorganca and @mxyng

Do you see major issues in my patch, which may prevent including it into the master branch? If so, please give me feedback so that I can update accordingly

Kind regards

jessegross · 2025-07-08T23:56:09Z

I don't think that we need to introduce a new configuration setting for this. We can make it so that the default setting packs the model onto the minimum number of GPUs required and then OLLAMA_SCHED_SPREAD=1 splits it evenly over all models.

dan-and · 2025-07-09T04:04:30Z

Hi Jesse,
thanks for the feedback.
I will rearrange my patch so that it is the default behavior. Give me 2-3 days and it will be updated.

New algorithm for finding the least amount of GPUs required is the new default.

dan-and · 2025-07-11T14:28:45Z

@jessegross
Updated to make it the default behavior, and OLLAMA_SCHED_SPREAD=1 will spread it evenly.

Because of my patch, I have introduced two slog.Info and one slog.Debug into the code. It helped me quite well to figure out if ollama is doing the correct allocation with various system setups and configs. Should I keep them in the code, or would you suggest removing them or moving the slog.Info to slog.Debug to keep the amount of logs low?

Regards,
Daniel

PS: It works great on my 7 GPU system:

OLLAMA_SCHED_SPREAD=1

daniel@gpu:~/source/ollama_dan$ ollama run qwen3:4b --verbose "say hi!"
Thinking...
Okay, the user said "say hi!" so I need to respond with a friendly greeting. Let me make sure to keep it warm and welcoming. Maybe add an emoji to make it more lively. I should also ask how I can assist them today. Let me
check the tone—should be positive and eager to help. Alright, that should do it.
...done thinking.

Hello! 😊 How can I assist you today?

total duration: 11.976054545s
load duration: 10.255610156s
prompt eval count: 13 token(s)
prompt eval duration: 358.652306ms
prompt eval rate: 36.25 tokens/s
eval count: 88 token(s)
eval duration: 1.36080978s
eval rate: 64.67 tokens/s

daniel@gpu:~/source/ollama_dan$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:4b 2bfd38a7daaf 20 GB 100% GPU 8192 29 minutes from now

OLLAMA_SCHED_SPREAD=0 or not set at all

daniel@gpu:~/source/ollama_dan$ ollama run qwen3:4b --verbose "say hi!"
Thinking...
Okay, the user said "say hi!" so I need to respond in a friendly and welcoming manner. Let me start by acknowledging their greeting. Maybe use an emoji to keep it light and approachable.
I should mention my name, Qwen, to establish identity. Then, offer assistance by asking how I can help them today. Keep it concise but open-ended so they feel comfortable to share their
needs. Make sure the tone is positive and eager to help. Let me put that all together.
...done thinking.

Hello! I'm Qwen, your helpful assistant. How can I assist you today? 😊

total duration: 5.474187439s
load duration: 3.6653167s
prompt eval count: 13 token(s)
prompt eval duration: 139.89265ms
prompt eval rate: 92.93 tokens/s
eval count: 125 token(s)
eval duration: 1.668211081s
eval rate: 74.93 tokens/s

daniel@gpu:~/source/ollama_dan$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:4b 2bfd38a7daaf 7.3 GB 100% GPU 8192 29 minutes from now

jessegross · 2025-07-12T00:15:52Z

Yes, I think you can remove the extra logging and the memory calculations that go into them. Please also retain existing comments and TODOs that aren't affected by your change.

Since you are iterating from smallest to largest number of GPUs, once you find a set that fits, you can just return that. You don't need to keep looking and find the best one - no later tries are going to use fewer GPUs.

I have some in-progress code that does something similar for a different code path, you can use it as a reference:

ollama/llm/server.go

Line 777 in f4c2e62

if !envconfig.SchedSpread() {

restored unsolved TODOs in sched.go aligned logic with assignLayers function from server.go and reducing unnecessary gpu matchings

dan-and · 2025-07-12T19:50:33Z

@jessegross
Thanks for the heads-up. I aligned the logic to be similar to your assignLayer concept. It's much cleaner now.

Still, my GPU selection is based on the one with the largest VRAM available at that moment. This will usually separate models from the GPUs to allow power savings when a model is idling.

I gave it a few rounds with 20 different models and sizes, and it works as intended.

The incorrectly removed TODOs are back in the code. Sorry for that.

dan-and · 2025-07-25T07:28:09Z

@jessegross any additional comments and hints for me?

jessegross · 2025-07-29T23:34:30Z

server/sched.go

+						var totalFreeMemory uint64
+						for _, g := range gpuSubset {
+							totalFreeMemory += g.FreeMemory
+						}


I think this calculation of totalFreeMemory is unused.

It would be good to keep the existing log message about loading the model though - you can just print something similar to the current multi-GPU case, where it doesn't display the free memory. i.e.
slog.Info("new model will fit in available VRAM, loading", "model", req.model.ModelPath, "library", sgl[0].Library, "parallel", p, "required", format.HumanBytes2(estimatedVRAM))

jessegross · 2025-07-29T23:36:26Z

server/sched.go

@@ -795,13 +798,14 @@ func pickBestFullFitByLibrary(req *LlmRequest, f *ggml.GGML, gpus discover.GpuIn
 		// - if multiple Libraries, see if any single GPU in any Library will fit
 		// - try subsets of GPUs instead of just falling back to 1 or all in a family

-		// Now try all the GPUs
+		// Now try all the GPUS (OLLAMA_SCHED_SPREAD is set)
 		for _, p := range numParallelToTry {


This whole block can be an else on if !envconfig.SchedSpread() {. No point in going through a loop only to check a constant condition in the middle.

I would also keep the existing log message here.

Thanks again, @jessegross , I readded the existing log files and replaced the if statement. I have overseen that from the development versions, when I had all 3 variants included.

Hope it it fine for you now.

Cheers

Readded existing log messages for both variants.

jessegross

This version looks good, thank you for your patience. I did make one change to your branch for the sake of reducing iterations - I removed the totalFreeMemory calculation, as this was unused. Hope you don't mind.

This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.

dan-and changed the title ~~server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe Speed Performance gains)~~ server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe-speed performance gains) May 12, 2025

dan-and added 2 commits May 13, 2025 11:25

Merge branch 'ollama:main' into SCHED_SPREAD_compact

eda0adc

Merge branch 'ollama:main' into SCHED_SPREAD_compact

0470916

Merge branch 'ollama:main' into SCHED_SPREAD_compact

0647aff

dan-and mentioned this pull request Jun 4, 2025

Qwen3 increases context to 64k, four GPUs, why is the proportion only 67%, CPU is 33%? #10968

Closed

dan-and added 3 commits June 5, 2025 13:35

Merge branch 'ollama:main' into SCHED_SPREAD_compact

8a16264

Refactored the minimal GPU subset concept as the original algorithm a…

9489c11

…lways calculated the overall memory requirement for all GPUs, which led to not fully utilized

Merge branch 'ollama:main' into SCHED_SPREAD_compact

997f9ba

dan-and changed the title ~~server/sched.go: Added support for grouping GPUs ( Power-Consumption + Low-PCIe-speed performance gains)~~ server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) Jun 5, 2025

dan-and requested review from preechapon-cmd and Re2906 June 6, 2025 09:38

Merge branch 'ollama:main' into SCHED_SPREAD_compact

8f4fd18

rick-github mentioned this pull request Jul 7, 2025

Optimizing GPU Usage for AI Models: Splitting Workloads Across Multiple GPUs Even if the Model Fits in One GPU #7104

Closed

dan-and added 2 commits July 10, 2025 19:42

Merge branch 'ollama:main' into SCHED_SPREAD_compact

3c6cca5

OLLAMA_SCHED_SPREAD setting back to original true / falsch boolean.

9c2fd94

New algorithm for finding the least amount of GPUs required is the new default.

dan-and added 2 commits July 12, 2025 21:33

Merge branch 'ollama:main' into SCHED_SPREAD_compact

33c30db

removed unnecessary loggings and comments.

7372edd

restored unsolved TODOs in sched.go aligned logic with assignLayers function from server.go and reducing unnecessary gpu matchings

Merge branch 'ollama:main' into SCHED_SPREAD_compact

72b5830

jessegross reviewed Jul 29, 2025

View reviewed changes

Merge branch 'ollama:main' into SCHED_SPREAD_compact

13e244b

dan-and and others added 10 commits August 1, 2025 15:29

Removed unnessary if !envconfig.SchedSpread

9a53edc

Readded existing log messages for both variants.

final fixes on the log files to include the GPU counts

b34a249

Merge branch 'ollama:main' into SCHED_SPREAD_compact

20af3e9

Merge branch 'ollama:main' into SCHED_SPREAD_compact

22679da

Merge branch 'ollama:main' into SCHED_SPREAD_compact

004d32f

Merge branch 'ollama:main' into SCHED_SPREAD_compact

2a322c4

Merge branch 'ollama:main' into SCHED_SPREAD_compact

99f0f4d

Merge branch 'ollama:main' into SCHED_SPREAD_compact

cc631fa

Merge branch 'ollama:main' into SCHED_SPREAD_compact

cc6a434

remove unused totalFreeMemory

e190078

jessegross approved these changes Aug 11, 2025

View reviewed changes

jessegross merged commit ea7657b into ollama:main Aug 11, 2025
8 checks passed

dan-and mentioned this pull request Aug 15, 2025

Ollama 0.11.5-RC2: New Memory Management: Ollama starts more instances than required. #11923

Closed

rick-github mentioned this pull request Aug 21, 2025

Removal of --tensor-split on 0.11.5 is a MASSIVE leap backward #12010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

Uh oh!

dan-and commented May 12, 2025 •

edited

Loading

Uh oh!

dan-and commented May 16, 2025

Uh oh!

dan-and commented Jun 5, 2025 •

edited

Loading

Uh oh!

dan-and commented Jun 23, 2025

Uh oh!

jessegross commented Jul 8, 2025

Uh oh!

dan-and commented Jul 9, 2025

Uh oh!

dan-and commented Jul 11, 2025 •

edited

Loading

Uh oh!

jessegross commented Jul 12, 2025

Uh oh!

dan-and commented Jul 12, 2025 •

edited

Loading

Uh oh!

dan-and commented Jul 25, 2025

Uh oh!

jessegross Jul 29, 2025

Uh oh!

jessegross Jul 29, 2025

Uh oh!

dan-and Aug 1, 2025

Uh oh!

jessegross left a comment

Uh oh!

Uh oh!

Uh oh!

server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678

Uh oh!

Conversation

dan-and commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benefits:

How to use:

Tests and Environment:

Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:

Results of Power consumption measured "at the wall" and eval rates

Uh oh!

dan-and commented May 16, 2025

Uh oh!

dan-and commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dan-and commented Jun 23, 2025

Uh oh!

jessegross commented Jul 8, 2025

Uh oh!

dan-and commented Jul 9, 2025

Uh oh!

dan-and commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jessegross commented Jul 12, 2025

Uh oh!

dan-and commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dan-and commented Jul 25, 2025

Uh oh!

jessegross Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jessegross Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

dan-and Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

jessegross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dan-and commented May 12, 2025 •

edited

Loading

dan-and commented Jun 5, 2025 •

edited

Loading

dan-and commented Jul 11, 2025 •

edited

Loading

dan-and commented Jul 12, 2025 •

edited

Loading