Skip to content

Conversation

ETOgaosion
Copy link
Collaborator

@ETOgaosion ETOgaosion commented May 17, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Allow to override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests, which is in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers (DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

Notice that now the imbalance PP layers distribution only compatible with dist_ckpt load and save, not support huggingface direct load/save.

Also, other megatron arguments can be passed through scripts.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Breaking APIs:

class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building
  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models

To avoid trouble of input same transformer config arguments, other models will reuse actor's config, so just need to input once.

Usage Example

run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: Megatron
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@ETOgaosion ETOgaosion changed the title [Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests [Megatron] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests May 17, 2025
@ETOgaosion ETOgaosion changed the title [Megatron] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests [Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests May 17, 2025
@vermouth1992 vermouth1992 requested a review from ccclyu May 17, 2025 07:08
@ccclyu ccclyu self-assigned this May 17, 2025
@ETOgaosion ETOgaosion marked this pull request as draft May 18, 2025 04:09
@ETOgaosion ETOgaosion marked this pull request as ready for review May 18, 2025 17:35
@ETOgaosion ETOgaosion force-pushed the megatron_auto_set_first_last_pp_stage_layers branch from 2bae721 to 211220e Compare May 20, 2025 16:58
"""
Create a base TransformerConfig with common parameters across different model architectures.
TODO: (ycl) use dataclass or converter config?

Args:
hf_config: HuggingFace model configuration
dtype: Data type for the model
**kwargs: Additional parameters to override defaults
override: Additional parameters to override defaults
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to override_transformer_config_kwargs

@@ -76,6 +76,7 @@ actor_rollout_ref:
use_dist_checkpointing: False
dist_checkpointing_path: null
seed: 1
override_transformer_config: {} # common transformer config for all models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the comments more straightforward like "additional transformer configs to override such as num_layers_in_first_pipeline_stage "?

@@ -708,6 +712,7 @@ def tensor_generator():

obj_spec_output = [None] * mpu.get_pipeline_model_parallel_world_size()
torch.distributed.all_gather_object(object_list=obj_spec_output, obj=meta_info, group=mpu.get_pipeline_model_parallel_group())
torch.distributed.all_gather_object(object_list=obj_spec_output, obj=meta_info, group=mpu.get_pipeline_model_parallel_group())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a duplicated line?

num_query_groups_per_partition = model_config.num_key_value_heads // mpu.get_tensor_model_parallel_world_size()
for chunk in infer_param.chunk(num_query_groups_per_partition):
split_size = [kv_size_per_tp * num_q_per_kv // num_query_groups_per_partition, kv_size_per_tp // num_query_groups_per_partition, kv_size_per_tp // num_query_groups_per_partition]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicated line. (maybe cause by conflict resolving?

@@ -720,6 +725,7 @@ def tensor_generator():
except StopIteration:
cur_name, cur_tensor = None, None
cur_name = normalize_model_name(name, cur_pp_rank, scan_vpp_idx, pp_size, vpp_size, model_config.num_hidden_layers)
cur_name = normalize_model_name(name, cur_pp_rank, scan_vpp_idx, pp_size, vpp_size, model_config.num_hidden_layers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated line

@@ -740,6 +746,8 @@ def tensor_generator():
infer_params = [torch.empty_like(broad_pp_tensor) for _ in range(all_gather_group_size)]
torch.distributed.all_gather(infer_params, broad_pp_tensor, group=mpu.get_tensor_model_parallel_group())
infer_params = default_tp_concat_fn(layer_name_mapping, cur_name, broad_pp_tensor, infer_params, model_config, convert_qkv_gate_up_by_simple_split)
torch.distributed.all_gather(infer_params, broad_pp_tensor, group=mpu.get_tensor_model_parallel_group())
infer_params = default_tp_concat_fn(layer_name_mapping, cur_name, broad_pp_tensor, infer_params, model_config, convert_qkv_gate_up_by_simple_split)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated line here.

@@ -748,3 +755,97 @@ def tensor_generator():
converted_names, converted_params = weight_converter.convert_param(cur_name, infer_params)

yield from zip(converted_names, converted_params)


def get_transformer_layer_offset(pipeline_rank, vp_rank, config: TransformerConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a high-level description of the differences with megatron's original function?

@ETOgaosion
Copy link
Collaborator Author

Thanks a lot for the reviewing and your suggestions, all resolved~

@vermouth1992 vermouth1992 merged commit 1cfa2be into volcengine:main May 22, 2025
36 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request May 22, 2025
…ustom megatron features like variable PP layers distribution, with CI tests (volcengine#1555)

### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Allow to override of transformer config to enable custom megatron
features like variable PP layers distribution, with CI tests, which is
in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers
(DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

**Notice that now the imbalance PP layers distribution only compatible
with dist_ckpt load and save, not support huggingface direct
load/save.**

Also, other megatron arguments can be passed through scripts.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

Breaking APIs:

```py
class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building
```

```yaml
  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models
```

To avoid trouble of input same transformer config arguments, other
models will reuse actor's config, so just need to input once.

### Usage Example

```bash
run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
SwordFaith added a commit to SwordFaith/verl that referenced this pull request May 22, 2025
ETOgaosion pushed a commit to ETOgaosion/verl that referenced this pull request May 24, 2025
ETOgaosion pushed a commit to SwordFaith/verl that referenced this pull request May 24, 2025
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…ustom megatron features like variable PP layers distribution, with CI tests (volcengine#1555)

### Checklist Before Starting

- [ ] Search for similar PR(s).

### What does this PR do?

Allow to override of transformer config to enable custom megatron
features like variable PP layers distribution, with CI tests, which is
in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers
(DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

**Notice that now the imbalance PP layers distribution only compatible
with dist_ckpt load and save, not support huggingface direct
load/save.**

Also, other megatron arguments can be passed through scripts.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

Breaking APIs:

```py
class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building
```

```yaml
  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models
```

To avoid trouble of input same transformer config arguments, other
models will reuse actor's config, so just need to input once.

### Usage Example

```bash
run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants