Skip to content

[Feature] Proposal: Releasing SGLang memory when idle #2583

@fzyzcjy

Description

@fzyzcjy

Proposal 1: Release KV cache when engine is idle

When using SGLang for generation in a training pipeline (such as PPO), at the phase of running HuggingFace model forward/backward, SGLang currently needs to take a lot of memory even though it does not use it. It would be great to make SGLang use as little memory as possible when it is idle.

Example usage cases:

  • Suppose we run OpenRLHF on 8xH100, the currently we may allocate 4xH100 for vllm/SGLang and another 4xH100 for HF model (thanks @zhaochenyang20 for providing this usage scenario).
    • If we make SGLang use little memory when idle, then we can run the same experiment on half number of GPUs (4xH100) by putting those SGLang engines on the same GPUs as HF models.
  • Suppose we run PPO on 1xH100 for a 7B model with Adam offloading (thanks @zhaochenyang20 for providing this usage scenario). Then policy (7Bx2) + critic (7Bx2) + ref (7Bx2) + reward (7Bx2) already takes 56B. The current SGLang needs 7Bx2 for weights and some memory for KV cache, thus it may not easy to fit the 80GB card.
    • If we implement the proposal 1 and proposal 2, we will have roughly 24B room for HF model forward/backward, and 24B room for SGLang to do generation. (We may have more if quantizing ref & reward model though not sure whether it will work.)
  • Suppose we run OpenRLHF on 1x4090 for a 0.5B model, then the memory is also very limited like the 1xH100 & 7B model case.
    • If the proposals are successfully implemented, we may be able to run in such scenarios.

One potential optimization for memory is to release KV cache:

  • When the training pipeline does not need SGLang (e.g. doing HF model forward/backward in PPO), let SGLang be in a "paused" mode, and later "resume" it when we need to use SGLang to do generation.
  • When SGLang enter "paused" mode, release the KV cache (link to hacky experiment) by simply deleting the tensors.
  • When SGLang later "resume", re-create the KV cache tensors.

I will PR for this as soon as having some time (hopefully soon).

Proposal 2: Release model weights when engine is paused

Another part of memory occupied by SGLang is the model weights. Thus one potential solution is:

  • When SGLang is paused, we delete the model weights (e.g. maybe by model.to('meta'), not tested) to release memory
  • When SGLang is resumed, we recreate empty model weights (e.g. by model.to_empty(device='cuda'))
  • Then, users should do update_weight to provide new weights to SGLang.
    • This is not an overhead, because during some RLHF processes, we already need to call update_weight before a generate to use the latest updated weights instead of outdated weights.

Proposal 3: Update SGLang model weights when on same GPU

Currently, when we do update_weight to copy HF model weight to SGLang model weight, it seems we will use the torch broadcast operation. However, when users put HuggingFace model and SGLang model on the same GPU, it may be possible to use more lightweight solutions to avoid the overhead of broadcast.

To be more specific:

  • Initialization
    • Users provide their HF model to SGLang Engine
    • SGLang shares the tensors of this model to the SGLang runtime process
  • Weight update
    • Users trigger "update weight from the previously provided HF model" operation
    • SGLang runtime process read the aforementioned tensor to update the SGLang model weights

This is just a rough draft and there can be more details. For example, if it is possible for the tensor objects in HF model to change, then we may need to send the new tensors across processes again.

Related: #2542
cc @zhaochenyang20

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions