feat: Add Megatron-LM based training #517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

SahilJain314 merged 57 commits into main from sahilj/megatron_base

Jun 17, 2025

Contributor

SahilJain314 commented Jun 13, 2025

No description provided.

SahilJain314 and others added 3 commits

June 13, 2025 15:06


          Added Megatron

b9c08ef

Co-authored-by: Sahil Jain <sahilj@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          bugfix

1cb40d7

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Lint and remove sequence packing artifacts

c086e59

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

SahilJain314 force-pushed the sahilj/megatron_base branch from f5f6c6a to c086e59 Compare

June 13, 2025 22:19

SahilJain314 added 17 commits

June 13, 2025 15:25


          Updated megatron setup func

cc92acd

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Updated uv lock

fb237d6

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Updated submodules

d293e61

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Brought in auto refit buffer size detection

af6753e

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          kwargs ray bug

3efe6ba

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          kwargs ray bug

0ebb26e

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          kwargs ray bug

4e0807a

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          kwargs ray bug

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          kwargs ray bug

45ba946

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Removed redundant test

e96e593

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Assert no CP

e8ad9dd

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          typefix

51e9cbf

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

doc

b26083f

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Update megatron

c50e0ee

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Attempt fix checkpoint hang

6ebf0d2

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Attempt fix checkpoint hang

8b2e9f3

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Attempt fix checkpoint hang

e5c1598

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

terrykong reviewed

View reviewed changes

nemo_rl/models/generation/vllm.py Show resolved Hide resolved

terrykong reviewed

View reviewed changes

nemo_rl/models/megatron/converters/common.py Outdated Show resolved Hide resolved

SahilJain314 added 7 commits

June 13, 2025 23:46


          Attempt fix checkpoint hang

6ea5c1b

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Attempt fix checkpoint hang

19d5358

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Attempt fix checkpoint hang

7e125a8

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Added Simple megatron loss

c139a7d

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          tests

55eb8c7

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Updated async queue closure

3b5c84c

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Loss tensor

7e85181

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

SahilJain314 added CI:L1 labels

SahilJain314 temporarily deployed to nemo-ci

June 16, 2025 21:13

— with

GitHub Actions Inactive

SahilJain314 added 3 commits

June 16, 2025 15:16


          disable param overlap gather

a046383

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          Disable expandable

c304ad1

Signed-off-by: Sahil Jain <sahilj@nvidia.com>


          lint

09d4470

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

terrykong previously approved these changes

View reviewed changes

Contributor Author

SahilJain314 commented Jun 17, 2025

Reverted expandable_segments due to #522

SahilJain314 added this pull request to the merge queue

parthchadha reviewed

View reviewed changes

nemo_rl/models/megatron/community_import.py Show resolved Hide resolved

nemo_rl/models/megatron/refit_utils.py Outdated Show resolved Hide resolved

nemo_rl/models/policy/megatron_policy_worker.py Show resolved Hide resolved

nemo_rl/models/policy/megatron_policy_worker.py Outdated Show resolved Hide resolved

github-merge-queue bot removed this pull request from the merge queue due to failed status checks


          raised timeout for ci

5fda2ed

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

SahilJain314 dismissed terrykong’s stale review via

5fda2ed

June 17, 2025 06:59

github-actions bot added the CI label

SahilJain314 enabled auto-merge

June 17, 2025 07:00

terrykong previously approved these changes

View reviewed changes


          comments

14e3120

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

SahilJain314 dismissed terrykong’s stale review via

14e3120

June 17, 2025 07:07

terrykong approved these changes

View reviewed changes

SahilJain314 added this pull request to the merge queue

Merged via the queue into main with commit ab622eb

13 of 14 checks passed

SahilJain314 deleted the sahilj/megatron_base branch

June 17, 2025 09:18

parthchadha reviewed

View reviewed changes

examples/configs/grpo_math_1B_megatron.yaml

+                  enabled: true
+                  empty_unused_memory_level: 0
+                  activation_checkpointing: false
+                  converter_type: "Qwen2ForCausalLM"

Contributor

parthchadha Jun 17, 2025

Why do users need to provide this? Can we infer from the model name? If not, please add a comment saying how to set this for a new model.

examples/configs/grpo_math_1B_megatron.yaml

+                    enabled: false #coming soon
+                    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+                    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+                    algorithm: "modified_ffd"

Contributor

parthchadha Jun 17, 2025

What are the other options? Lets add a description of these here.

examples/configs/grpo_math_1B_megatron.yaml

+                    overlap_param_gather: false
+                    average_in_collective: true
+                    use_custom_fsdp: false
+                    data_parallel_sharding_strategy: "optim_grads_params"

Contributor

parthchadha Jun 17, 2025

What are the other options? If there are no options (or we don't expect folks to use other options), I would prefer not exposing this in yaml file.

examples/configs/grpo_math_1B_megatron.yaml

+                  temperature: 1.0
+                  top_p: 1.0
+                  top_k: null
+                  vllm_cfg:

Contributor

parthchadha Jun 17, 2025

Can you remove the args that are not being overridden from 1b config?

examples/configs/grpo_math_1B_megatron.yaml

+                monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
+                wandb:
+                  project: "grpo-dev"
+                  name: "sj_megatron_1B"

Contributor

parthchadha Jun 17, 2025

Remove sj prefix.

nemo_rl/models/policy/megatron_policy_worker.py

+                        The logprob of input token i is specified at position i in the output logprobs tensor.
+                      """
+                      no_grad = torch.no_grad()
+                      no_grad.__enter__()

Contributor

parthchadha Jun 17, 2025

What's the reason to not use context manager here?

parthchadha pushed a commit that referenced this pull request


          feat: Add Megatron-LM based training (#517)

8985d65

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>

This was linked to issues Jun 24, 2025

Megatron Training + vLLM inference #47

Closed

Megatron Training + In-framework inference #48

Closed

YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request


          feat: Add Megatron-LM based training (NVIDIA-NeMo#517)

0d88be8

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>

KiddoZhu pushed a commit that referenced this pull request


          feat: Add Megatron-LM based training (#517)

b58e9b5

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>

SahilJain314 mentioned this pull request

Nd parallelism mesh description for replicating and sharding data #124

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 CI documentation