[mcore] option to use dist checkpoint #1030

ISEEKYAN · 2025-04-11T09:21:35Z

mcore dist checkpointing is a parallel-invariant weight format, you can save and load in arbitrary parallel settings. e.g. save in tp2pp2 and load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend. It is disabled by default for backward compatibility. But future support for mcore MoE models and VLM models will work only when dist ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would load the entire huggingface weights and then re-shard to correct mcore model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and each GPU will only load its parts from dist ckpt. The speed is faster and no more online resharding needed.

When loading Qwen2-7B-Instruct for critic worker, the loading time reduced from 109s to 25s, speedup by 4.36x

The converter_hf_to_mcore.py in this version use existing online resharding function to convert weights. And it should be refactored for better efficiency and MoE/VLM models.
Thanks to #998 for the optimization of loading hf weight only at GPU 0.

Future TODO:

refactor the converter for efficiency
support converting MoE models
support converting VLM models
re-design megatron_checkpoint_manager.py with dist ckpt
implement converter from mcore dist ckpt to hf / model_merger.py
add docs and example scripts

tool to convert weights from hf to mcore dist ckpt

mcore dist checkpointing is a parallel-invariant weight format, you can save and load in arbitrary parallel settings. e.g. save in tp2pp2 and load in tp4pp1. This PR introduce an option to use dist checkpoint with mcore backend. It is *disabled* by default for backward compatibility. But future support for *mcore MoE models and VLM models* will work only when dist ckpt is enabled for a easier implementation. Before this PR, when initing actor and critic workers, each GPU would load the entire huggingface weights and then re-shard to correct mcore model state dict, making the procedure slow and complicated. With this PR, we convert hf weight to dist ckpt by offline scripts, and each GPU will only load its parts from dist ckpt. The speed is faster and no more online resharding needed. When loading `Qwen2-7B-Instruct` for critic worker, the loading time reduced from 109s to 25s, speedup by *4.36x* The `converter_hf_to_mcore.py` in this version use existing online resharding function to convert weights. And it should be refactored for better efficiency and MoE/VLM models. Thanks to volcengine#998 for the optimization of loading hf weight only at GPU 0. Future TODO: * refactor the converter for efficiency * support converting MoE models * support converting VLM models * re-design `megatron_checkpoint_manager.py` with dist ckpt * implement converter from mcore dist ckpt to hf / `model_merger.py` * add docs and example scripts

add option to use mcore dist ckpt

24b0d50

tool to convert weights from hf to mcore dist ckpt

vermouth1992 approved these changes Apr 11, 2025

View reviewed changes

ISEEKYAN mentioned this pull request Apr 11, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

ETOgaosion merged commit d4cae44 into volcengine:main Apr 13, 2025
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mcore] option to use dist checkpoint #1030

[mcore] option to use dist checkpoint #1030

Uh oh!

ISEEKYAN commented Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!

[mcore] option to use dist checkpoint #1030

[mcore] option to use dist checkpoint #1030

Uh oh!

Conversation

ISEEKYAN commented Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!