Skip to content

Conversation

guyueh1
Copy link
Contributor

@guyueh1 guyueh1 commented Jul 1, 2025

What does this PR do ?

In refit, reduce the number of IPC calls to once per bucket by packing weights.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
guyueh1 and others added 3 commits July 3, 2025 08:53
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@parthchadha parthchadha added the CI:L0 Run doctests and unit tests label Jul 3, 2025
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jul 4, 2025
@yuki-97
Copy link
Contributor

yuki-97 commented Jul 4, 2025

seems there's a flaky unit test #610, I'll look it.

@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jul 4, 2025
@yuki-97
Copy link
Contributor

yuki-97 commented Jul 4, 2025

Hi @guyueh1 , I think you have some numbers of the refit time w/ and w/o your PR, can you paste them in the PR?

@guyueh1
Copy link
Contributor Author

guyueh1 commented Jul 7, 2025

@yuki-666 a reference datapoint: deepseek-v3 refit, the total time for update_weights_from_ipc_handles is reduced by 1.8x

@terrykong terrykong added this pull request to the merge queue Jul 7, 2025
Merged via the queue into main with commit adb9e61 Jul 8, 2025
21 of 23 checks passed
@terrykong terrykong deleted the guyueh/feat_refit_reduce_ipc_calls branch July 8, 2025 01:36
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
…A-NeMo#589)

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
…A-NeMo#589)

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L0 Run doctests and unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants