-
Notifications
You must be signed in to change notification settings - Fork 12.9k
server/sched.go: Added support for grouping GPUs ( significant VRAM reduction, Power-Consumption + Low-PCIe-speed performance gains) #10678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…uirements. This can be helpful for systems with multiple GPUs which are not completely utilized. Use OLLAMA_SCHED_SPREAD=0 (und keep it unset) to use only one GPU and then evenly distribute the load across all GPUs. (old default behavior) Use OLLAMA_SCHED_SPREAD=2 to keep the amount of GPUs to a minimum and only use as many as needed to run the model. Use OLLAMA_SCHED_SPREAD=1 (or set to anything else) to evenly distribute the load across all GPUs. (old default behavior)
Is there anything I can do to get it reviewed? |
…lways calculated the overall memory requirement for all GPUs, which led to not fully utilized
@rick-github , thanks for pointing me to the VRAM allocation effect of my patch, as this is next to the power-consumption a huge benefit. There are examples of how much VRAM overhead ollama is consuming, when we are using the tactic of spreading the model over all available GPUs instead of grouping them. |
Hi @jmorganca and @mxyng Do you see major issues in my patch, which may prevent including it into the master branch? If so, please give me feedback so that I can update accordingly Kind regards |
I don't think that we need to introduce a new configuration setting for this. We can make it so that the default setting packs the model onto the minimum number of GPUs required and then OLLAMA_SCHED_SPREAD=1 splits it evenly over all models. |
Hi Jesse, |
New algorithm for finding the least amount of GPUs required is the new default.
@jessegross Because of my patch, I have introduced two slog.Info and one slog.Debug into the code. It helped me quite well to figure out if ollama is doing the correct allocation with various system setups and configs. Should I keep them in the code, or would you suggest removing them or moving the slog.Info to slog.Debug to keep the amount of logs low? Regards, PS: It works great on my 7 GPU system: OLLAMA_SCHED_SPREAD=1
OLLAMA_SCHED_SPREAD=0 or not set at all
|
Yes, I think you can remove the extra logging and the memory calculations that go into them. Please also retain existing comments and TODOs that aren't affected by your change. Since you are iterating from smallest to largest number of GPUs, once you find a set that fits, you can just return that. You don't need to keep looking and find the best one - no later tries are going to use fewer GPUs. I have some in-progress code that does something similar for a different code path, you can use it as a reference: Line 777 in f4c2e62
|
restored unsolved TODOs in sched.go aligned logic with assignLayers function from server.go and reducing unnecessary gpu matchings
@jessegross Still, my GPU selection is based on the one with the largest VRAM available at that moment. This will usually separate models from the GPUs to allow power savings when a model is idling. I gave it a few rounds with 20 different models and sizes, and it works as intended. The incorrectly removed TODOs are back in the code. Sorry for that. |
@jessegross any additional comments and hints for me? |
server/sched.go
Outdated
var totalFreeMemory uint64 | ||
for _, g := range gpuSubset { | ||
totalFreeMemory += g.FreeMemory | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this calculation of totalFreeMemory
is unused.
It would be good to keep the existing log message about loading the model though - you can just print something similar to the current multi-GPU case, where it doesn't display the free memory. i.e.
slog.Info("new model will fit in available VRAM, loading", "model", req.model.ModelPath, "library", sgl[0].Library, "parallel", p, "required", format.HumanBytes2(estimatedVRAM))
server/sched.go
Outdated
@@ -795,13 +798,14 @@ func pickBestFullFitByLibrary(req *LlmRequest, f *ggml.GGML, gpus discover.GpuIn | |||
// - if multiple Libraries, see if any single GPU in any Library will fit | |||
// - try subsets of GPUs instead of just falling back to 1 or all in a family | |||
|
|||
// Now try all the GPUs | |||
// Now try all the GPUS (OLLAMA_SCHED_SPREAD is set) | |||
for _, p := range numParallelToTry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block can be an else on if !envconfig.SchedSpread() {
. No point in going through a loop only to check a constant condition in the middle.
I would also keep the existing log message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, @jessegross , I readded the existing log files and replaced the if statement. I have overseen that from the development versions, when I had all 3 variants included.
Hope it it fine for you now.
Cheers
Readded existing log messages for both variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version looks good, thank you for your patience. I did make one change to your branch for the sake of reducing iterations - I removed the totalFreeMemory
calculation, as this was unused. Hope you don't mind.
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.
This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.
Benefits:
How to use:
Tests and Environment:
Like so many in the home-lab community, my Ollama server is built on a budget, which resulted in a low-power optimized server with 7 GPUs (Nvidia RTX 3060 - 12GB) with a total of 72GB VRAM.
When using models which require around 30-40 GB, all GPUs have to run, which results in a higher power consumption and a huge overhead in VRAM allocation. With this patch, just a few models are in use, which powers down all other GPUs to around 3 watts.
Since the server (Ryzen 5500G / Asus Prime B450) is not enterprise grade, the PCIe bus far slower, which also slows down the utilization when using too many GPUs with one single model.
Results of Memory-Allocation Overhead when spreading a model over more than 2 GPUs:
In this example, we configure Ollama to num_parallel=2, num_ctx=8192
Model: gemma3:4b - a2af6cc3eb7f - 3.3 GB
Overall required VRAM when spreading over:
1x GPU: 5.7 GiB
2x GPU: 7.8 GiB
3x GPU: 9.3 GiB
4x GPU: 10.8 GiB
5x GPU: 12.3 GiB
6x GPU: 13.9 GiB
7x GPU: 15.4 GiB
a more realistic situation could be qwen3:30b - 0b28110b7a33 - 18 GB , which exceeds the home-lab gpus.
3x GPU: 27.2 GiB
4x GPU: 30.1 GiB
5x GPU: 32.9 GiB
6x GPU: 35.8 GiB
7x GPU: 38.7 GiB
Results of Power consumption measured "at the wall" and eval rates
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD unset: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=1: 450 Watt with 27 fps
Qwen3:14b - bdbd181c33f2 - OLLAMA_SCHED_SPREAD=2: 310 Watt with 81 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD unset: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=1: 393 Watt with 44 fps
Qwen3:30b - 0b28110b7a33 - OLLAMA_SCHED_SPREAD=2: 320 Watt with 51 fps
(Yes, Qwen3:14b is far more inefficient on my system than the slightly younger 30b variant)
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD unset: 250 Watt with 70 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=1 : 389 Watt with 55 fps
gemma3:4b - a2af6cc3eb7f - OLLAMA_SCHED_SPREAD=2 : 250 Watt with 70 fps
(gemma3:4b with num_ctx of 8192 also resulted in just utilizing 1 GPUs with OLLAMA_SCHED_SPREAD=2 similar to the unset OLLAMA_SCHED_SPREAD as one GPU is enough to fit the whole model)