Skip to content

📌 Pin liger-kernel and vLLM #2952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 24, 2025
Merged

📌 Pin liger-kernel and vLLM #2952

merged 2 commits into from
Feb 24, 2025

Conversation

qgallouedec
Copy link
Member

liger-kernel: v0.5.3 introduced a bug, see linkedin/Liger-Kernel#586
vLLM, starting from 0.7.3, learning hangs while gathering. Reported in the vLLM slack:


Hi there!

I wanted to try how it would speed things up with GRPO but the a subsequent gather seems to hang with 0.7.3 (it's not the case with 0.7.2): any idea why?

# demo_vllm.py
from unittest.mock import patch
from accelerate.utils import gather_object
from accelerate import Accelerator
from vllm import LLM


def main():
    accelerator = Accelerator()
    if accelerator.is_main_process:
        # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM model
        # on the desired device (world_size_patch) and (2) avoid a test that is not designed for our setting (profiling_patch).
        world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
        profiling_patch = patch("vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", return_value=None)
        with world_size_patch, profiling_patch:
            LLM(model="Qwen/Qwen2.5-1.5B", device="cuda:7")

    # When using vLLM, the main process is responsible for loading the model weights. This can cause process desynchronization and seems
    # to lead to DeepSpeed hanging during initialization. To prevent this, we  synchronize all processes after vLLM has been fully initialized.
    accelerator.wait_for_everyone()

    prompts_text = ["Some text"]
    gather_object(prompts_text)  # it hangs here
    print("after gather")


if __name__ == "__main__":
    trainer = main()
accelerate launch --num_processes 7 demo_vllm.py

@qgallouedec qgallouedec changed the title Pin liger-kernel and vLLM 📌 Pin liger-kernel and vLLM Feb 24, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec merged commit 45ccdef into main Feb 24, 2025
13 of 14 checks passed
@qgallouedec qgallouedec deleted the pin-liger-kernel branch February 24, 2025 23:34
qgallouedec added a commit that referenced this pull request Feb 25, 2025
* pin liger-kernel

* style
kashif pushed a commit that referenced this pull request Feb 27, 2025
* pin liger-kernel

* style
This was referenced Feb 27, 2025
jhinpan pushed a commit to jhinpan/trl-jin that referenced this pull request Mar 12, 2025
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants