Skip to content

Introduce "Groups" Design #107

@mostlygeek

Description

@mostlygeek

There have been several issues where introducing a grouping behaviour could address:

This is sort of the other side of what llama-swap was originally designed for; to keep some models loaded and swap other others. Giving this some thinking time I believe adding groups could address the new requests.

Some design requirements and constraints for building this feature:

  1. Do not break current configuration files (as much as possible)
  2. Users are responsible for resolving configuration conflicts
  3. Model IDs are globally unique
  4. Groups can have swapping disabled. Default is swap: true
  5. Groups may be exclusive. They force other groups to unload. Default is exclusive: true
  6. Groups may be persistent. Prevents other, exclusive groups from unloading the. Default is persistent:false

This is what the configuration would look like:

# the models definitions are unchanged. 
models:  
    m1: 
        ... 
    m2: 
        ... 
    m3:
        ... 
    .
    . ... 

# introduction of a groups top level key: 
groups: 
    G1:
        swap: true
        exclusive: true
        members: 
            - m1
            - m2
    G2:
        swap: false
        exclusive: true
        members:
            - m3
            - m4
    G3:
        swap: true
        exclusive: false
        members:
            - m5
            - m6
            - m7
    G4:
        swap: false
        exclusive: false
        members:
            - m8
            - m9
    G5:
        swap: false
        exclusive: false
        persistent: true
        members:
            - m10
            - m11

In the above configuration:

  • G1 will run m1 OR m3. It will cause other groups to unload.
  • G2 will run m2 AND m3. It will cause other groups to unload.
  • G3 will run m4 OR m5 OR m6. It will NOT affect other groups.
  • G4 will run m7 AND m8. It will NOT affect other groups.
  • G5 will run m10 AND m11. It will NOT affect other groups. It is NOT affected by other groups. This keeps a set of models always loaded. The only way to unload these models is to restart llama-swap or call the /unload endpoint.

What about models that are not members of groups?

There is a default and hidden group that is essentially:

groups: 
    (default):
        swap: true
        exclusive: true
        members: [ all models not in a group ]

Setting swap: true and exclusive: true is the current behaviour of llama-swap, only one model runs at atime.

What about profiles?

With this, the profiles feature which has caused a lot of confusion can be removed. A profile was an attempt to keep multiple models loaded at the same time using a prefix. This could be possible using a G2 or G4 style group. I will break rule 1 however as the complex profile code will be removed.

This setting:

profiles:
  coding:
      - qwen-coder-32B
      - qwen-coder-3090-FIM

is now replaced with:

groups:
  coding:
    swap: false
    exclusive: true
    members:
      - qwen-coder-32B
      - qwen-coder-3090-FIM

There is no longer a need to prepend the profile name with the model. Models can be requested by just their identifier.

What is rule 5 about?

This exists because there is a sea of complexity and possible issues as users mix/match hardware, operating systems, inference servers, etc. llama-swap follows a unix foot gun philosophy. While the configuration is designed to be simple it can quickly grow in complexity. As complexity grows so does the expectation of users knowing what they are doing. It's also there to protect my own time and sanity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions