-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[XPU] Enable external_launcher to serve as an executor via torchrun #21021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: chzhang <chaojun.zhang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables torchrun
for XPU workers by specifying the correct process group for the all_reduce
warmup call. The change is correct for multi-GPU scenarios. However, I've identified a critical issue where this could lead to a crash in single-GPU environments. I've provided a suggestion to make the all_reduce
call conditional on the world size to ensure robustness.
torch.distributed.all_reduce(torch.zeros(1).xpu(), | ||
group=get_world_group().device_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all_reduce
call for warming up oneCCL is necessary for multi-GPU setups, but it will likely fail in a single-GPU environment (world_size=1
), especially if torch.distributed
is not initialized. The warmup is not needed for a single device.
To prevent crashes in single-GPU deployments, this operation should be conditional on self.parallel_config.world_size > 1
.
torch.distributed.all_reduce(torch.zeros(1).xpu(), | |
group=get_world_group().device_group) | |
if self.parallel_config.world_size > 1: | |
torch.distributed.all_reduce(torch.zeros(1).xpu(), | |
group=get_world_group().device_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason we introduce this is there are know hang issue in previous oneccl/torch-ccl. This should be fixed in recent low level updates, confirmed that torch 2.8 should be fixed.
So I think we can keep this for now. and we should remove this trick when upgrade to torch 2.8.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
LGTM. cc @ys950902 Please take a look. |
@simon-mo @DarkLight1337 , may you help to review this PR? |
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: qizixi <qizixi@meta.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: shuw <shuw@nvidia.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: x22x22 <wadeking@qq.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
…llm-project#21021) Signed-off-by: chzhang <chaojun.zhang@intel.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
torchrun pre-initializes the distributed environment, binding the default group to the Gloo backend, will cause the all_reduce with xpu tensor won't work. fix by specifying the group as the device group referred to by get_world_group
Test Plan
export VLLM_WORKER_MULTIPROC_METHOD=spawn
cd tests/
torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
Test Result
(Optional) Documentation Update