feat: optimize refit by reducing set of IPC handles sent to each device #634

ZhiyuLi-Nvidia · 2025-07-10T02:39:12Z

What does this PR do ?

Optimization for refitting:

avoid duplicate ipc passing across ray: it removes one linearity factor in TP scaling in refitting. Instead of broadcast all ipc handles just sent the necessary ones. The overhead in argument serialization is constant to TP scaling now.
- see ~25% gain in refitting speed in small scale and expect to be much more impactful with with large TP size.
- in dsv3 w/ 64 tp. Found 3.5x speed (from 420s to 120s) with the change.
~~avoid waiting for update_weights_from_ipc_handles for better overlap:~~
- ~~avoid unnecessary waiting time for better overlap ~8% gain in refitting speed in small scale~~
- reverted given unsafe memory control

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

nemo_rl/models/generation/vllm.py

ZhiyuLi-Nvidia · 2025-07-12T00:54:00Z

This change is compatible to async_llm.
~~TODO: fix conflicts in async LLM~~

guyueh1

LGTM

nemo_rl/models/generation/vllm.py

yuki-97 · 2025-07-15T03:33:24Z

@ZhiyuLi-Nvidia report_device_id is already supported in asyncLLM, see report_device_id_async.

ZhiyuLi-Nvidia · 2025-07-15T05:22:18Z

@ZhiyuLi-Nvidia report_device_id is already supported in asyncLLM, see report_device_id_async.

Yeap. I have seen this implementation. I had some issues to correctly call it within VllmGenerationWorker:

report_device_id_async is now used in VllmGeneration along with worker_group functions like run_all_workers_single_data
I want to call and wait for the result within the VllmGenerationWorker scope. I have tried but either returning future/coroutine objects or not correctly call the async report_device_id_async function. Still need sometime to investigating.

Let me know if you have any suggestions

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

fix lint Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

…ce (#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

…ce (NVIDIA-NeMo#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>

…ce (#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>

…ce (NVIDIA-NeMo#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>

…ce (NVIDIA-NeMo#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

ZhiyuLi-Nvidia requested review from yfw, parthchadha, yuki-97 and guyueh1 July 10, 2025 02:39

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch 2 times, most recently from f9e29be to 98f2b88 Compare July 10, 2025 06:32

yuki-97 reviewed Jul 10, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Outdated Show resolved Hide resolved

nemo_rl/models/generation/vllm.py Outdated Show resolved Hide resolved

guyueh1 reviewed Jul 10, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Show resolved Hide resolved

guyueh1 previously approved these changes Jul 10, 2025

View reviewed changes

terrykong changed the title ~~feat: refitting optimization~~ feat: optimize refit by reducing set of IPC handles sent to each device Jul 10, 2025

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from a3b950c to 97b4679 Compare July 10, 2025 23:16

yuki-97 previously approved these changes Jul 10, 2025

View reviewed changes

parthchadha previously approved these changes Jul 10, 2025

View reviewed changes

terrykong enabled auto-merge July 11, 2025 00:33

terrykong added this pull request to the merge queue Jul 11, 2025

yfw previously approved these changes Jul 11, 2025

View reviewed changes

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 11, 2025

ZhiyuLi-Nvidia dismissed stale reviews from yfw, parthchadha, yuki-97, and guyueh1 via d8d87e0 July 11, 2025 23:25

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from d8d87e0 to 21cb80d Compare July 11, 2025 23:26

guyueh1 previously approved these changes Jul 12, 2025

View reviewed changes

ZhiyuLi-Nvidia dismissed guyueh1’s stale review via d0f4e35 July 14, 2025 07:48

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from 21cb80d to d0f4e35 Compare July 14, 2025 07:48

terrykong added the r0.3.0 Release r0.3.0 label Jul 14, 2025

yfw reviewed Jul 14, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Show resolved Hide resolved

yfw previously approved these changes Jul 14, 2025

View reviewed changes

guyueh1 previously approved these changes Jul 14, 2025

View reviewed changes

ZhiyuLi-Nvidia dismissed stale reviews from guyueh1 and yfw via ff05a1a July 15, 2025 06:56

terrykong approved these changes Jul 15, 2025

View reviewed changes

terrykong enabled auto-merge July 15, 2025 21:14

ZhiyuLi-Nvidia and others added 4 commits July 15, 2025 23:16

feat: optimize refit by reducing set of IPC handles sent to each device

56aac87

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

fix lint

4ae37a0

fix lint Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

add TODO

c70e031

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

feat: add post_init to support get device_id in async vllm init (#668)

22f18b9

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from ff05a1a to 22f18b9 Compare July 15, 2025 23:17

terrykong added this pull request to the merge queue Jul 15, 2025

Merged via the queue into main with commit 22eea9b Jul 16, 2025
13 of 14 checks passed

terrykong deleted the zhiyul/refit_optimization branch July 16, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize refit by reducing set of IPC handles sent to each device #634

feat: optimize refit by reducing set of IPC handles sent to each device #634

Uh oh!

ZhiyuLi-Nvidia commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Jul 12, 2025 •

edited

Loading

Uh oh!

guyueh1 left a comment

Uh oh!

Uh oh!

yuki-97 commented Jul 15, 2025 •

edited

Loading

Uh oh!

ZhiyuLi-Nvidia commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

feat: optimize refit by reducing set of IPC handles sent to each device #634

feat: optimize refit by reducing set of IPC handles sent to each device #634

Uh oh!

Conversation

ZhiyuLi-Nvidia commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuki-97 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Jul 10, 2025 •

edited

Loading

ZhiyuLi-Nvidia commented Jul 12, 2025 •

edited

Loading

yuki-97 commented Jul 15, 2025 •

edited

Loading