Skip to content

Conversation

SahilJain314
Copy link
Contributor

No description provided.

SahilJain314 and others added 3 commits June 13, 2025 15:06
Co-authored-by: Sahil Jain <sahilj@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 force-pushed the sahilj/megatron_base branch from f5f6c6a to c086e59 Compare June 13, 2025 22:19
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 added CI:L1 Run doctests, unit tests, and functional tests labels Jun 16, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
terrykong
terrykong previously approved these changes Jun 17, 2025
@SahilJain314
Copy link
Contributor Author

Reverted expandable_segments due to #522

@SahilJain314 SahilJain314 added this pull request to the merge queue Jun 17, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 17, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@github-actions github-actions bot added the CI Relating to CI label Jun 17, 2025
@SahilJain314 SahilJain314 enabled auto-merge June 17, 2025 07:00
terrykong
terrykong previously approved these changes Jun 17, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 added this pull request to the merge queue Jun 17, 2025
Merged via the queue into main with commit ab622eb Jun 17, 2025
13 of 14 checks passed
@SahilJain314 SahilJain314 deleted the sahilj/megatron_base branch June 17, 2025 09:18
enabled: true
empty_unused_memory_level: 0
activation_checkpointing: false
converter_type: "Qwen2ForCausalLM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do users need to provide this? Can we infer from the model name? If not, please add a comment saying how to set this for a new model.

enabled: false #coming soon
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
algorithm: "modified_ffd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the other options? Lets add a description of these here.

overlap_param_gather: false
average_in_collective: true
use_custom_fsdp: false
data_parallel_sharding_strategy: "optim_grads_params"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the other options? If there are no options (or we don't expect folks to use other options), I would prefer not exposing this in yaml file.

temperature: 1.0
top_p: 1.0
top_k: null
vllm_cfg:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the args that are not being overridden from 1b config?

monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "grpo-dev"
name: "sj_megatron_1B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sj prefix.

The logprob of input token i is specified at position i in the output logprobs tensor.
"""
no_grad = torch.no_grad()
no_grad.__enter__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to not use context manager here?

parthchadha pushed a commit that referenced this pull request Jun 17, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jul 14, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Megatron Training + In-framework inference Megatron Training + vLLM inference
4 participants