📌 Pin liger-kernel and vLLM #2952

qgallouedec · 2025-02-24T22:59:27Z

liger-kernel: v0.5.3 introduced a bug, see linkedin/Liger-Kernel#586
vLLM, starting from 0.7.3, learning hangs while gathering. Reported in the vLLM slack:

Hi there!

I wanted to try how it would speed things up with GRPO but the a subsequent gather seems to hang with 0.7.3 (it's not the case with 0.7.2): any idea why?

# demo_vllm.py
from unittest.mock import patch
from accelerate.utils import gather_object
from accelerate import Accelerator
from vllm import LLM


def main():
    accelerator = Accelerator()
    if accelerator.is_main_process:
        # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM model
        # on the desired device (world_size_patch) and (2) avoid a test that is not designed for our setting (profiling_patch).
        world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
        profiling_patch = patch("vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", return_value=None)
        with world_size_patch, profiling_patch:
            LLM(model="Qwen/Qwen2.5-1.5B", device="cuda:7")

    # When using vLLM, the main process is responsible for loading the model weights. This can cause process desynchronization and seems
    # to lead to DeepSpeed hanging during initialization. To prevent this, we  synchronize all processes after vLLM has been fully initialized.
    accelerator.wait_for_everyone()

    prompts_text = ["Some text"]
    gather_object(prompts_text)  # it hangs here
    print("after gather")


if __name__ == "__main__":
    trainer = main()

accelerate launch --num_processes 7 demo_vllm.py

HuggingFaceDocBuilderDev · 2025-02-24T23:06:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* pin liger-kernel * style

pin liger-kernel

69dd7bb

qgallouedec changed the title ~~Pin liger-kernel and vLLM~~ 📌 Pin liger-kernel and vLLM Feb 24, 2025

style

f891f22

qgallouedec merged commit 45ccdef into main Feb 24, 2025
13 of 14 checks passed

qgallouedec deleted the pin-liger-kernel branch February 24, 2025 23:34

qgallouedec added a commit that referenced this pull request Feb 25, 2025

📌 Pin liger-kernel and vLLM (#2952)

d4098d1

* pin liger-kernel * style

kashif pushed a commit that referenced this pull request Feb 27, 2025

📌 Pin liger-kernel and vLLM (#2952)

5b5caf1

* pin liger-kernel * style

This was referenced Feb 27, 2025

GRPO Stuck on Step 0 #2977

Open

Agents #2936

Closed

tchang1997 mentioned this pull request Mar 2, 2025

NCCL timeout when GRPO training with vllm #2923

Open

jhinpan pushed a commit to jhinpan/trl-jin that referenced this pull request Mar 12, 2025

📌 Pin liger-kernel and vLLM (huggingface#2952)

456e1c9

* pin liger-kernel * style

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

📌 Pin liger-kernel and vLLM (huggingface#2952)

fe4e8bb

* pin liger-kernel * style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📌 Pin liger-kernel and vLLM #2952

📌 Pin liger-kernel and vLLM #2952

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

📌 Pin liger-kernel and vLLM #2952

📌 Pin liger-kernel and vLLM #2952

Uh oh!

Conversation

qgallouedec commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!