Skip to content

Conversation

ZhiyuLi-Nvidia
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Jul 10, 2025

What does this PR do ?

Optimization for refitting:

  • avoid duplicate ipc passing across ray: it removes one linearity factor in TP scaling in refitting. Instead of broadcast all ipc handles just sent the necessary ones. The overhead in argument serialization is constant to TP scaling now.
    • see ~25% gain in refitting speed in small scale and expect to be much more impactful with with large TP size.
    • in dsv3 w/ 64 tp. Found 3.5x speed (from 420s to 120s) with the change.
  • avoid waiting for update_weights_from_ipc_handles for better overlap:
    • avoid unnecessary waiting time for better overlap ~8% gain in refitting speed in small scale
    • reverted given unsafe memory control

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch 2 times, most recently from f9e29be to 98f2b88 Compare July 10, 2025 06:32
guyueh1
guyueh1 previously approved these changes Jul 10, 2025
@terrykong terrykong changed the title feat: refitting optimization feat: optimize refit by reducing set of IPC handles sent to each device Jul 10, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from a3b950c to 97b4679 Compare July 10, 2025 23:16
yuki-97
yuki-97 previously approved these changes Jul 10, 2025
parthchadha
parthchadha previously approved these changes Jul 10, 2025
@terrykong terrykong enabled auto-merge July 11, 2025 00:33
@terrykong terrykong added this pull request to the merge queue Jul 11, 2025
yfw
yfw previously approved these changes Jul 11, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 11, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia dismissed stale reviews from yfw, parthchadha, yuki-97, and guyueh1 via d8d87e0 July 11, 2025 23:25
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from d8d87e0 to 21cb80d Compare July 11, 2025 23:26
@ZhiyuLi-Nvidia
Copy link
Contributor Author

ZhiyuLi-Nvidia commented Jul 12, 2025

This change is compatible to async_llm.
TODO: fix conflicts in async LLM

guyueh1
guyueh1 previously approved these changes Jul 12, 2025
Copy link
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from 21cb80d to d0f4e35 Compare July 14, 2025 07:48
@terrykong terrykong added the r0.3.0 Release r0.3.0 label Jul 14, 2025
yfw
yfw previously approved these changes Jul 14, 2025
guyueh1
guyueh1 previously approved these changes Jul 14, 2025
@yuki-97
Copy link
Contributor

yuki-97 commented Jul 15, 2025

@ZhiyuLi-Nvidia report_device_id is already supported in asyncLLM, see report_device_id_async.

@ZhiyuLi-Nvidia
Copy link
Contributor Author

@ZhiyuLi-Nvidia report_device_id is already supported in asyncLLM, see report_device_id_async.

Yeap. I have seen this implementation. I had some issues to correctly call it within VllmGenerationWorker:

  • report_device_id_async is now used in VllmGeneration along with worker_group functions like run_all_workers_single_data
  • I want to call and wait for the result within the VllmGenerationWorker scope. I have tried but either returning future/coroutine objects or not correctly call the async report_device_id_async function. Still need sometime to investigating.

Let me know if you have any suggestions

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia dismissed stale reviews from guyueh1 and yfw via ff05a1a July 15, 2025 06:56
@terrykong terrykong enabled auto-merge July 15, 2025 21:14
ZhiyuLi-Nvidia and others added 4 commits July 15, 2025 23:16
fix lint

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/refit_optimization branch from ff05a1a to 22f18b9 Compare July 15, 2025 23:17
@terrykong terrykong added this pull request to the merge queue Jul 15, 2025
Merged via the queue into main with commit 22eea9b Jul 16, 2025
13 of 14 checks passed
@terrykong terrykong deleted the zhiyul/refit_optimization branch July 16, 2025 01:35
ZhiyuLi-Nvidia added a commit that referenced this pull request Jul 21, 2025
…ce (#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
…ce (#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jul 30, 2025
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
r0.3.0 Release r0.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants