Skip to content

Conversation

ashors1
Copy link
Contributor

@ashors1 ashors1 commented Jun 26, 2025

What does this PR do ?

When using overlap param gather, Mcore does a single param all-gather after each backward pass (before the first forward step). This is problematic when the first forward pass following a backward pass is using the reference model parameters, because we end up doing an all-gather on the reference parameters rather than the model being trained. The solution is to disable overlap param gather when running reference model forward and re-enable after.

Issues

Closes #552

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

ashors1 added 2 commits June 26, 2025 09:26
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 requested a review from terrykong June 26, 2025 17:09
@ashors1 ashors1 changed the title Fix overlap param gather fix: fix overlap param gather Jun 26, 2025
@ashors1 ashors1 marked this pull request as ready for review June 27, 2025 00:06
@ashors1 ashors1 requested a review from SahilJain314 June 27, 2025 00:06
@terrykong terrykong added this pull request to the merge queue Jun 27, 2025
Merged via the queue into main with commit bb30ecd Jun 28, 2025
13 of 16 checks passed
@terrykong terrykong deleted the ashors/fix_overlap_param_gather branch June 28, 2025 02:31
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jun 28, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jun 30, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jun 30, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Xuehan <xxman@google.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jun 30, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Xuehan <xxman@google.com>
therealnaveenkamal pushed a commit to therealnaveenkamal/RL that referenced this pull request Jul 7, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jul 14, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
Signed-off-by: ashors1 <ashors@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix overlap param gather + distributed optimizer in Megatron path
2 participants