-
Notifications
You must be signed in to change notification settings - Fork 118
feat: optimize refit by preparing refit info ahead of time #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@yuki-666 |
61abcf9
to
de56749
Compare
de56749
to
f57e799
Compare
f57e799
to
fc4d64e
Compare
Yup, I added it in ebb874a. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yuki-666. LGTM!
9c0e833
to
e9b22fd
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…mcore for speedup Signed-off-by: Yuki Huang <yukih@nvidia.com>
c685241
to
f152f61
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
)" This reverts commit 8f7d71e
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
Separate the refit process changes from #613.
What does this PR do ?
e_score_correction_bias
) will change during training, have some special handle with it, andrefit_param_info_mcore
is not cached for now because of this.Test Result
convergence
time cost
In mcore w/ packing (dsv3 w/ 64 tp)
*The ~20s overhead is due to offload.

Refit Process Changes
Colocated
Previous
prepare_weights_for_ipc
in train side.get_weights_ipc_handles
in train side andupdate_weights_from_ipc_handles
in inference side.Now
prepare_refit_info
in train side.prepare_weights_for_ipc
in train side.get_weights_ipc_handles
in train side andupdate_weights_from_ipc_handles
in inference side.Non-colocated
Previous
init_collective
in both train and inference side.prepare_info_for_collective
in train side.broadcast_weights_for_collective
in train side andupdate_weights_from_collective
in inference side.Now
init_collective
in both train and inference side.prepare_refit_info
in both train and inference side.broadcast_weights_for_collective
in train side andupdate_weights_from_collective
in inference side.