[bugfix] fix megatron model merger #1774

ShareLer · 2025-05-30T08:09:20Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Fix megatron model merger.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

Fix get rank method to support just TP.
Fix state_dict keys after convert.
Add mla/moe convert support.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

Test with Qwen3-8B and Qwen2.5-7B.

Additional Info.

Issue Number: Fixes issue vllm cannot load model after megatron training #1757
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

Signed-off-by: ShareLer <ShareLe@163.com>

ETOgaosion · 2025-05-31T15:25:27Z

@ShareLer Thanks a lot for helping us to fix this, it helps a lot~

Could you briefly point out what causes vllm inference failure? Seems also involved a lot of refactorization. Is it caused by the missing parameters transferring?

Signed-off-by: ShareLer <ShareLe@163.com>

ShareLer · 2025-06-01T04:21:02Z

@ShareLer Thanks a lot for helping us to fix this, it helps a lot~

Could you briefly point out what causes vllm inference failure? Seems also involved a lot of refactorization. Is it caused by the missing parameters transferring?

Three main reasons:

The name of the converted layer has not been modified in _merge_state_dicts().
model.decoder.layers.xxx in converted ckpt, but the actual is supposed to be model.layers.xxx.
This is also the reason for the failure in the mentioned issue.
The name of qkv in the attention layer is error after converted.
linear_qkv is converted to linear_q/linear_k/linear_v, but in reality it should be q_proj/k_proj/v_proj
The weight of the final output layer is processed incorrectly.
When is_value_model=False, the output_layer is ColumnParallelLinear, but the weights of different TP were not merged in _merge_state_dicts().
(When submitting for the first time, my method ignored the value_model, resulting in the failure of CI. It has just been fixed.)

eric-haibin-lin

is it possible to add a test that reproduces the issue?

ShareLer · 2025-06-01T07:30:57Z

is it possible to add a test that reproduces the issue?

You can reproduce this problem very simply by using the CI script (like job 'e2e_ppo_trainer_megatron-qwen3' in e2e_ppo_trainer_megatron.yml) just change command option in merger：
You need change the test operation in python scripts/model_merger.py test --backend megatron to merge.

There were no problems in the previous CI test because different logics were used in the test and merge options:
First of all, they all obtained the merged weights through the _merge_state_dicts() method. However, it should be noted that there are some problems with state_dicts at this time (the three problems in the previous reply).
Next, in the merge option, this problematic state_dicts was directly saved as the final ckpt. But these problematic layer names were corrected in the test option (remove the decoder and correct the name of the qkv) which used in CI.

ETOgaosion · 2025-06-03T05:54:24Z

The name of the converted layer has not been modified in _merge_state_dicts().
model.decoder.layers.xxx in converted ckpt, but the actual is supposed to be model.layers.xxx.
This is also the reason for the failure in the mentioned issue.

@ShareLer Could you add some assertions for checking whether naming prefix is valid as model_merger's request?

Or whether it can have some robustness to get elements from string, like fetch the layer_index from [-3] index of split strings?

ShareLer · 2025-06-03T06:33:42Z

The name of the converted layer has not been modified in _merge_state_dicts().
model.decoder.layers.xxx in converted ckpt, but the actual is supposed to be model.layers.xxx.
This is also the reason for the failure in the mentioned issue.

@ShareLer Could you add some assertion for checking whether naming prefix is valid as model_merger's request?

Or whether it can have some robustness to get elements from string, like fetch the layer_index from [-3] index of split strings?

Sorry, I don't quite understand what you mean.
Do you mean that we should perform some checks on the keys in merged_state_dict before saving the merged ckpt to ensure its validity?

ETOgaosion · 2025-06-03T07:59:20Z

scripts/model_merger.py

@@ -444,28 +485,28 @@ def _merge_state_dicts(self, model_state_dict_lst: list[list[dict]], tp_size: in
                        print("skip lm_head and reward_head loading because of tie_word_embeddings")
                        continue



Maybe at here we do some check on the mcore's key to check whether it's valid for _replace_name to work?

Yes, we can add an assert to the result of _replace_name to ensure that it is not None, because we define the layer transformation relationship in self.params_mapping, and normally the result should not appear None.
Do you think this is feasible?

Maybe I didn't make myself clear(qaq). I can help to add some checking and warnings~

Signed-off-by: ShareLer <ShareLe@163.com>

ShareLer · 2025-06-05T09:49:22Z

@ETOgaosion Hi, I have added additional judgments for embedding layer and output_layer on the basis of your code.

ETOgaosion · 2025-06-05T17:30:24Z

@ShareLer Thanks for helping, I was about to test on my machine~

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix megatron model merger. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Fix get rank method to support just TP. - Fix state_dict keys after convert. - Add mla/moe convert support. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test Test with Qwen3-8B and Qwen2.5-7B. ### Additional Info. - **Issue Number**: Fixes issue volcengine#1757 - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Signed-off-by: ShareLer <ShareLe@163.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>

[bugfix] fix megatron model merger

df71023

Signed-off-by: ShareLer <ShareLe@163.com>

vermouth1992 requested a review from ETOgaosion May 30, 2025 08:49

ccclyu added the status: review in process label May 31, 2025

fix output layer for value model

ce3828d

Signed-off-by: ShareLer <ShareLe@163.com>

eric-haibin-lin reviewed Jun 1, 2025

View reviewed changes

ETOgaosion reviewed Jun 3, 2025

View reviewed changes

ShareLer and others added 3 commits June 3, 2025 16:56

add assert for layer name.

7c84c24

Signed-off-by: ShareLer <ShareLe@163.com>

add some checking and warnings

b2606ef

fix judgement for state key

0603c0f

Signed-off-by: ShareLer <ShareLe@163.com>

ETOgaosion added 2 commits June 6, 2025 17:55

fix bias

de3a944

confirm format

32c4e86

ETOgaosion approved these changes Jun 9, 2025

View reviewed changes

ETOgaosion merged commit cc9bc3f into volcengine:main Jun 9, 2025
33 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] fix megatron model merger #1774

[bugfix] fix megatron model merger #1774

Uh oh!

ShareLer commented May 30, 2025 •

edited

Loading

Uh oh!

ETOgaosion commented May 31, 2025 •

edited

Loading

Uh oh!

ShareLer commented Jun 1, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

ShareLer commented Jun 1, 2025

Uh oh!

ETOgaosion commented Jun 3, 2025 •

edited

Loading

Uh oh!

ShareLer commented Jun 3, 2025

Uh oh!

ETOgaosion Jun 3, 2025

Uh oh!

ShareLer Jun 3, 2025

Uh oh!

ETOgaosion Jun 3, 2025 •

edited

Loading

Uh oh!

ShareLer commented Jun 5, 2025

Uh oh!

ETOgaosion commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -444,28 +485,28 @@ def _merge_state_dicts(self, model_state_dict_lst: list[list[dict]], tp_size: in
		print("skip lm_head and reward_head loading because of tie_word_embeddings")
		continue

[bugfix] fix megatron model merger #1774

[bugfix] fix megatron model merger #1774

Uh oh!

Conversation

ShareLer commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

ETOgaosion commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShareLer commented Jun 1, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

ShareLer commented Jun 1, 2025

Uh oh!

ETOgaosion commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShareLer commented Jun 3, 2025

Uh oh!

ETOgaosion Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

ShareLer Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShareLer commented Jun 5, 2025

Uh oh!

ETOgaosion commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

ShareLer commented May 30, 2025 •

edited

Loading

ETOgaosion commented May 31, 2025 •

edited

Loading

ETOgaosion commented Jun 3, 2025 •

edited

Loading

ETOgaosion Jun 3, 2025 •

edited

Loading