Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Dec 28, 2024

Related: #2542 and #2583

Outdated Content

The test will fail because it uses LD_PRELOAD currently (to intercept and change logic of cudaMalloc and cudaFree). If the general logic looks good, I will further update this PR to handle this part (e.g. try to specify LD_PRELOAD automatically when creating the backend process.)

How to execute it

Suppose this branch of SGLang is at /path/to/sglang, then inside sglang's docker container, execute the following:

# install the torch_memory_saver (currently install from source, surely on pip later)
git clone https://github.com/fzyzcjy/torch_memory_saver
(cd torch_memory_saver && make reinstall)

cd /path/to/sglang
PYTHONPATH=$(pwd)/python LD_PRELOAD=/sgl-workspace/torch_memory_saver/torch_memory_saver_cpp.cpython-310-x86_64-linux-gnu.so python3 test/srt/test_release_gpu_occupation.py

Expected results are as follows. x is time, red color is memory consume. The low memory at the center is caused by temporarily release KV cache memory.

image

What's changed

Though the PR seems large, most are boilerplate.

Core:

  • Wrap things inside with primary_memory_saver.region(): TokenToKVPool.k_buffers/v_buffers, ModelRunner.model, ReqToTokenPool.req_to_token
  • Call primary_memory_saver.pause()/.resume(): At scheduler.py, Scheduler.release_gpu_occupation/resume_gpu_occupation

Others:

  • Add Engine.release_gpu_occupation/resume_gpu_occupation: Need to add multiple structs such as ReleaseGPUOccupationReqInput, and multiple forwarding boilerplate in Engine/TokenizerManager
  • Add a base class BaseCausalLM: Need to change all models' parent classes.
  • Add tests

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Just one final comment.

@@ -590,6 +596,7 @@ def init_memory_pool(
max_context_len=self.model_config.context_len + 4,
device=self.device,
use_records=False,
enable_memory_saver=self.server_args.enable_memory_saver,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment #2630 (comment)

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I misunderstood your question. I am worried maybe not, because it can happen that we do have some tensors inside req pool in the future that needs to be preserved across memory release. (The torch_memory_saver does not offload things to cpu, instead it just throws away the content, in order to be faster)

But anyway, today it seems to be OK (will check it), so I can update it if needed. Quickly skimmed it, worried maybe BaseTokenToKVPool.free_slots is one such tensor that hopes not to be released. Too tired now and not do any experiment though.

@merrymercy merrymercy merged commit 923f518 into sgl-project:main Jan 13, 2025
15 checks passed
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants