-
Notifications
You must be signed in to change notification settings - Fork 77
Description
There have been several issues where introducing a grouping behaviour could address:
- [Feature Request] Add configuration to never unload a model #99
- [Feature Request] Add ability to configure which models can be loaded together #96
This is sort of the other side of what llama-swap was originally designed for; to keep some models loaded and swap other others. Giving this some thinking time I believe adding groups could address the new requests.
Some design requirements and constraints for building this feature:
- Do not break current configuration files (as much as possible)
- Users are responsible for resolving configuration conflicts
- Model IDs are globally unique
- Groups can have swapping disabled. Default is
swap: true
- Groups may be exclusive. They force other groups to unload. Default is
exclusive: true
- Groups may be persistent. Prevents other, exclusive groups from unloading the. Default is
persistent:false
This is what the configuration would look like:
# the models definitions are unchanged.
models:
m1:
...
m2:
...
m3:
...
.
. ...
# introduction of a groups top level key:
groups:
G1:
swap: true
exclusive: true
members:
- m1
- m2
G2:
swap: false
exclusive: true
members:
- m3
- m4
G3:
swap: true
exclusive: false
members:
- m5
- m6
- m7
G4:
swap: false
exclusive: false
members:
- m8
- m9
G5:
swap: false
exclusive: false
persistent: true
members:
- m10
- m11
In the above configuration:
G1
will runm1 OR m3
. It will cause other groups to unload.G2
will runm2 AND m3
. It will cause other groups to unload.G3
will runm4 OR m5 OR m6
. It will NOT affect other groups.G4
will runm7 AND m8
. It will NOT affect other groups.G5
will runm10 AND m11
. It will NOT affect other groups. It is NOT affected by other groups. This keeps a set of models always loaded. The only way to unload these models is to restart llama-swap or call the/unload
endpoint.
What about models that are not members of groups?
There is a default and hidden group that is essentially:
groups:
(default):
swap: true
exclusive: true
members: [ all models not in a group ]
Setting swap: true
and exclusive: true
is the current behaviour of llama-swap, only one model runs at atime.
What about profiles?
With this, the profiles
feature which has caused a lot of confusion can be removed. A profile was an attempt to keep multiple models loaded at the same time using a prefix. This could be possible using a G2
or G4
style group. I will break rule 1 however as the complex profile code will be removed.
This setting:
profiles:
coding:
- qwen-coder-32B
- qwen-coder-3090-FIM
is now replaced with:
groups:
coding:
swap: false
exclusive: true
members:
- qwen-coder-32B
- qwen-coder-3090-FIM
There is no longer a need to prepend the profile name with the model. Models can be requested by just their identifier.
What is rule 5 about?
This exists because there is a sea of complexity and possible issues as users mix/match hardware, operating systems, inference servers, etc. llama-swap follows a unix foot gun philosophy. While the configuration is designed to be simple it can quickly grow in complexity. As complexity grows so does the expectation of users knowing what they are doing. It's also there to protect my own time and sanity.