-
Notifications
You must be signed in to change notification settings - Fork 121
feat: Add Megatron-LM based training #517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Sahil Jain <sahilj@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
f5f6c6a
to
c086e59
Compare
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Reverted expandable_segments due to #522 |
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
enabled: true | ||
empty_unused_memory_level: 0 | ||
activation_checkpointing: false | ||
converter_type: "Qwen2ForCausalLM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do users need to provide this? Can we infer from the model name? If not, please add a comment saying how to set this for a new model.
enabled: false #coming soon | ||
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}} | ||
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}} | ||
algorithm: "modified_ffd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other options? Lets add a description of these here.
overlap_param_gather: false | ||
average_in_collective: true | ||
use_custom_fsdp: false | ||
data_parallel_sharding_strategy: "optim_grads_params" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other options? If there are no options (or we don't expect folks to use other options), I would prefer not exposing this in yaml file.
temperature: 1.0 | ||
top_p: 1.0 | ||
top_k: null | ||
vllm_cfg: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the args that are not being overridden from 1b config?
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard | ||
wandb: | ||
project: "grpo-dev" | ||
name: "sj_megatron_1B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove sj prefix.
The logprob of input token i is specified at position i in the output logprobs tensor. | ||
""" | ||
no_grad = torch.no_grad() | ||
no_grad.__enter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to not use context manager here?
Signed-off-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
No description provided.