[Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests #1555

ETOgaosion · 2025-05-17T03:18:47Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Allow to override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests, which is in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers (DeepSeek V3)

We will first fix e2e_prime CI by use fused kernels.

Notice that now the imbalance PP layers distribution only compatible with dist_ckpt load and save, not support huggingface direct load/save.

Also, other megatron arguments can be passed through scripts.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Breaking APIs:

class MegatronWorker(Worker):
    def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):

# and the models building

  actor:
    megatron:
      override_transformer_config: {} # common transformer config for all models

To avoid trouble of input same transformer config arguments, other models will reuse actor's config, so just need to input once.

Usage Example

run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: Megatron
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

ccclyu · 2025-05-20T07:35:30Z

verl/models/mcore/config_converter.py

    """
    Create a base TransformerConfig with common parameters across different model architectures.
    TODO: (ycl) use dataclass or converter config?

    Args:
        hf_config: HuggingFace model configuration
        dtype: Data type for the model
-        **kwargs: Additional parameters to override defaults
+        override: Additional parameters to override defaults


change to override_transformer_config_kwargs

ccclyu · 2025-05-20T07:59:17Z

verl/trainer/config/ppo_megatron_trainer.yaml

@@ -76,6 +76,7 @@ actor_rollout_ref:
      use_dist_checkpointing: False
      dist_checkpointing_path: null
      seed: 1
+      override_transformer_config: {} # common transformer config for all models


can we make the comments more straightforward like "additional transformer configs to override such as num_layers_in_first_pipeline_stage "?

ccclyu · 2025-05-20T18:01:48Z

verl/utils/megatron_utils.py

@@ -708,6 +712,7 @@ def tensor_generator():

    obj_spec_output = [None] * mpu.get_pipeline_model_parallel_world_size()
    torch.distributed.all_gather_object(object_list=obj_spec_output, obj=meta_info, group=mpu.get_pipeline_model_parallel_group())
+    torch.distributed.all_gather_object(object_list=obj_spec_output, obj=meta_info, group=mpu.get_pipeline_model_parallel_group())


is this a duplicated line?

ccclyu · 2025-05-20T18:02:30Z

verl/utils/megatron_utils.py

            num_query_groups_per_partition = model_config.num_key_value_heads // mpu.get_tensor_model_parallel_world_size()
            for chunk in infer_param.chunk(num_query_groups_per_partition):
+                split_size = [kv_size_per_tp * num_q_per_kv // num_query_groups_per_partition, kv_size_per_tp // num_query_groups_per_partition, kv_size_per_tp // num_query_groups_per_partition]


nit: duplicated line. (maybe cause by conflict resolving?

ccclyu · 2025-05-20T18:02:42Z

verl/utils/megatron_utils.py

@@ -720,6 +725,7 @@ def tensor_generator():
            except StopIteration:
                cur_name, cur_tensor = None, None
            cur_name = normalize_model_name(name, cur_pp_rank, scan_vpp_idx, pp_size, vpp_size, model_config.num_hidden_layers)
+            cur_name = normalize_model_name(name, cur_pp_rank, scan_vpp_idx, pp_size, vpp_size, model_config.num_hidden_layers)


duplicated line

ccclyu · 2025-05-20T18:02:59Z

verl/utils/megatron_utils.py

@@ -740,6 +746,8 @@ def tensor_generator():
                infer_params = [torch.empty_like(broad_pp_tensor) for _ in range(all_gather_group_size)]
                torch.distributed.all_gather(infer_params, broad_pp_tensor, group=mpu.get_tensor_model_parallel_group())
            infer_params = default_tp_concat_fn(layer_name_mapping, cur_name, broad_pp_tensor, infer_params, model_config, convert_qkv_gate_up_by_simple_split)
+                torch.distributed.all_gather(infer_params, broad_pp_tensor, group=mpu.get_tensor_model_parallel_group())
+            infer_params = default_tp_concat_fn(layer_name_mapping, cur_name, broad_pp_tensor, infer_params, model_config, convert_qkv_gate_up_by_simple_split)


duplicated line here.

ccclyu · 2025-05-20T18:13:51Z

verl/utils/megatron_utils.py

@@ -748,3 +755,97 @@ def tensor_generator():
        converted_names, converted_params = weight_converter.convert_param(cur_name, infer_params)

        yield from zip(converted_names, converted_params)
+
+
+def get_transformer_layer_offset(pipeline_rank, vp_rank, config: TransformerConfig):


can we have a high-level description of the differences with megatron's original function?

ETOgaosion · 2025-05-21T03:15:10Z

Thanks a lot for the reviewing and your suggestions, all resolved~

…ustom megatron features like variable PP layers distribution, with CI tests (volcengine#1555) ### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Allow to override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests, which is in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers (DeepSeek V3) We will first fix e2e_prime CI by use fused kernels. **Notice that now the imbalance PP layers distribution only compatible with dist_ckpt load and save, not support huggingface direct load/save.** Also, other megatron arguments can be passed through scripts. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API Breaking APIs: ```py class MegatronWorker(Worker): def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config): # and the models building ``` ```yaml actor: megatron: override_transformer_config: {} # common transformer config for all models ``` To avoid trouble of input same transformer config arguments, other models will reuse actor's config, so just need to input once. ### Usage Example ```bash run_ppo_trainer_megatron.sh \ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11 ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: Megatron - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

vermouth1992 requested a review from ccclyu May 17, 2025 07:08

ccclyu added the status: need review label May 17, 2025

ccclyu self-assigned this May 17, 2025

ETOgaosion marked this pull request as draft May 18, 2025 04:09

ETOgaosion marked this pull request as ready for review May 18, 2025 17:35

ETOgaosion mentioned this pull request May 19, 2025

[megatron] feat: save hf model config in megatron checkpoint manager #1562

Merged

6 tasks

ETOgaosion and others added 15 commits May 21, 2025 00:55

implemented

620e209

refactored

91936e6

refactored

17b7455

fix actor ref

1f96706

fix odd number of pp layers

b9679a0

try PP 4, PP=2 may not support

3f19bf4

fix parallelism

786873c

qwen shall use dist_ckpt

86c4c1d

fix old APIs

8b6c5d0

try fix CI

ec92f18

forget actor

567bf5d

4 pipeline, wrong config

5ad5c34

error copy

cc5032a

fix e2e prime CI

12a94c0

release the all close

211220e

ETOgaosion force-pushed the megatron_auto_set_first_last_pp_stage_layers branch from 2bae721 to 211220e Compare May 20, 2025 16:58

fix indent

51660f9

ccclyu reviewed May 20, 2025

View reviewed changes

ccclyu added status: review in process and removed status: need review labels May 20, 2025

resolve review

2449a5b

ccclyu approved these changes May 21, 2025

View reviewed changes

ETOgaosion added 2 commits May 21, 2025 14:48

disable fused kernel

7c5d2c4

redundant merge

fe5f4fc

vermouth1992 approved these changes May 22, 2025

View reviewed changes

vermouth1992 merged commit 1cfa2be into volcengine:main May 22, 2025
36 checks passed

ccclyu added status: accepted and removed status: review in process labels May 22, 2025

SwordFaith added a commit to SwordFaith/verl that referenced this pull request May 22, 2025

Fix sglang part of volcengine#1555 override tf config in megatron

f1d61ce

ETOgaosion pushed a commit to ETOgaosion/verl that referenced this pull request May 24, 2025

Fix sglang part of volcengine#1555 override tf config in megatron

e38bd34

ETOgaosion pushed a commit to SwordFaith/verl that referenced this pull request May 24, 2025

Fix sglang part of volcengine#1555 override tf config in megatron

d034ab8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests #1555

[Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests #1555

Uh oh!

ETOgaosion commented May 17, 2025 •

edited

Loading

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ccclyu May 20, 2025

Uh oh!

ETOgaosion commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

[Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests #1555

[Megatron][BREAKING] Allow override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests #1555

Uh oh!

Conversation

ETOgaosion commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

ETOgaosion commented May 17, 2025 •

edited

Loading