Skip to content

Conversation

yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Jul 10, 2025

Separate the refit process changes from #613.

What does this PR do ?

  1. Update refit process, prepare refit info ahead of time to avoid repeated sending metadata from train side to inference side each time.
  2. Add NRL_REFIT_BUFFER_MEMORY_RATIO and update the default value from 10% to 20%.
  3. Found the dtype of some tensors (e.g. e_score_correction_bias) will change during training, have some special handle with it, and refit_param_info_mcore is not cached for now because of this.

Test Result

convergence

dtensor mcore mcore w/ packing (dsv3 w/ 64 tp)
image image image

time cost
In mcore w/ packing (dsv3 w/ 64 tp)

  1. Speedup from 420s to 105s with preparing metainfo.
  2. Speedup from 105s to 55s with updating refit buffer memory ratio.

*The ~20s overhead is due to offload.
image

Refit Process Changes

Colocated

Previous

  • Each time in refit
    1. prepare_weights_for_ipc in train side.
    2. get_weights_ipc_handles in train side and update_weights_from_ipc_handles in inference side.

Now

  • Init
    1. prepare_refit_info in train side.
  • Each time in refit
    1. prepare_weights_for_ipc in train side.
    2. get_weights_ipc_handles in train side and update_weights_from_ipc_handles in inference side.

Non-colocated

Previous

  • Init
    1. init_collective in both train and inference side.
  • Each time in refit
    1. prepare_info_for_collective in train side.
    2. broadcast_weights_for_collective in train side and update_weights_from_collective in inference side.

Now

  • Init
    1. init_collective in both train and inference side.
    2. prepare_refit_info in both train and inference side.
  • Each time in refit
    1. broadcast_weights_for_collective in train side and update_weights_from_collective in inference side.

@ZhiyuLi-Nvidia
Copy link
Contributor

@yuki-666
just curious, does the refit_info correspond to the policy model (like DTensor or Mcore) or generation model (vllm one)?
At first glance, look like refit_info in the change is relevant to the policy model. We probably need the generation model (vllm one) to avoid duplicated meta data transfer in collective_rpc.
Do you know if we can extract the dtype and shape without actually converting the params (from policy model to the generation one).

@yuki-97 yuki-97 force-pushed the yukih/prepare-refit-info branch from 61abcf9 to de56749 Compare July 11, 2025 04:29
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 11, 2025
@yuki-97 yuki-97 force-pushed the yukih/prepare-refit-info branch from de56749 to f57e799 Compare July 11, 2025 06:20
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 11, 2025
@yuki-97 yuki-97 force-pushed the yukih/prepare-refit-info branch from f57e799 to fc4d64e Compare July 11, 2025 07:21
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 11, 2025
@yuki-97
Copy link
Contributor Author

yuki-97 commented Jul 11, 2025

Do you know if we can extract the dtype and shape without actually converting the params (from policy model to the generation one).

Yup, I added it in ebb874a.

ZhiyuLi-Nvidia
ZhiyuLi-Nvidia previously approved these changes Jul 11, 2025
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yuki-666. LGTM!

@yuki-97 yuki-97 force-pushed the yukih/prepare-refit-info branch from 9c0e833 to e9b22fd Compare July 13, 2025 11:08
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 13, 2025
@yuki-97 yuki-97 changed the title feat: prepare refit info feat: optimize refit by preparing refit info ahead of time Jul 14, 2025
@yuki-97 yuki-97 marked this pull request as ready for review July 14, 2025 06:46
yuki-97 added 8 commits July 15, 2025 20:39
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…mcore for speedup

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/prepare-refit-info branch from c685241 to f152f61 Compare July 16, 2025 03:40
yfw
yfw previously approved these changes Jul 16, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@terrykong terrykong added this pull request to the merge queue Jul 16, 2025
Merged via the queue into main with commit 8f7d71e Jul 16, 2025
13 of 14 checks passed
@terrykong terrykong deleted the yukih/prepare-refit-info branch July 16, 2025 21:50
ZhiyuLi-Nvidia pushed a commit that referenced this pull request Jul 21, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
…Mo#638)

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
yaoyu-33 added a commit that referenced this pull request Jul 28, 2025
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jul 30, 2025
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
…Mo#638)

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests r0.3.0 Release r0.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants