Skip to content

Conversation

ISEEKYAN
Copy link
Contributor

mcore dist checkpointing is a parallel-invariant weight format, you can save and load in arbitrary parallel settings. e.g. save in tp2pp2 and load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend. It is disabled by default for backward compatibility. But future support for mcore MoE models and VLM models will work only when dist ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would load the entire huggingface weights and then re-shard to correct mcore model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and each GPU will only load its parts from dist ckpt. The speed is faster and no more online resharding needed.

When loading Qwen2-7B-Instruct for critic worker, the loading time reduced from 109s to 25s, speedup by 4.36x

The converter_hf_to_mcore.py in this version use existing online resharding function to convert weights. And it should be refactored for better efficiency and MoE/VLM models.
Thanks to #998 for the optimization of loading hf weight only at GPU 0.

Future TODO:

  • refactor the converter for efficiency
  • support converting MoE models
  • support converting VLM models
  • re-design megatron_checkpoint_manager.py with dist ckpt
  • implement converter from mcore dist ckpt to hf / model_merger.py
  • add docs and example scripts

tool to convert weights from hf to mcore dist ckpt
@ETOgaosion ETOgaosion merged commit d4cae44 into volcengine:main Apr 13, 2025
22 of 23 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Apr 18, 2025
mcore dist checkpointing is a parallel-invariant weight format, you can
save and load in arbitrary parallel settings. e.g. save in tp2pp2 and
load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend.
It is *disabled* by default for backward compatibility. But future
support for *mcore MoE models and VLM models* will work only when dist
ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would
load the entire huggingface weights and then re-shard to correct mcore
model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and
each GPU will only load its parts from dist ckpt. The speed is faster
and no more online resharding needed.

When loading `Qwen2-7B-Instruct` for critic worker, the loading time
reduced from 109s to 25s, speedup by *4.36x*

The `converter_hf_to_mcore.py` in this version use existing online
resharding function to convert weights. And it should be refactored for
better efficiency and MoE/VLM models.
Thanks to volcengine#998 for the optimization of loading hf weight only at GPU 0.

Future TODO:
* refactor the converter for efficiency
* support converting MoE models
* support converting VLM models
* re-design `megatron_checkpoint_manager.py` with dist ckpt
* implement converter from mcore dist ckpt to hf / `model_merger.py`
* add docs and example scripts
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
mcore dist checkpointing is a parallel-invariant weight format, you can
save and load in arbitrary parallel settings. e.g. save in tp2pp2 and
load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend.
It is *disabled* by default for backward compatibility. But future
support for *mcore MoE models and VLM models* will work only when dist
ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would
load the entire huggingface weights and then re-shard to correct mcore
model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and
each GPU will only load its parts from dist ckpt. The speed is faster
and no more online resharding needed.

When loading `Qwen2-7B-Instruct` for critic worker, the loading time
reduced from 109s to 25s, speedup by *4.36x*

The `converter_hf_to_mcore.py` in this version use existing online
resharding function to convert weights. And it should be refactored for
better efficiency and MoE/VLM models.
Thanks to volcengine#998 for the optimization of loading hf weight only at GPU 0.

Future TODO:
* refactor the converter for efficiency
* support converting MoE models
* support converting VLM models
* re-design `megatron_checkpoint_manager.py` with dist ckpt
* implement converter from mcore dist ckpt to hf / `model_merger.py`
* add docs and example scripts
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
mcore dist checkpointing is a parallel-invariant weight format, you can
save and load in arbitrary parallel settings. e.g. save in tp2pp2 and
load in tp4pp1.

This PR introduce an option to use dist checkpoint with mcore backend.
It is *disabled* by default for backward compatibility. But future
support for *mcore MoE models and VLM models* will work only when dist
ckpt is enabled for a easier implementation.

Before this PR, when initing actor and critic workers, each GPU would
load the entire huggingface weights and then re-shard to correct mcore
model state dict, making the procedure slow and complicated.
With this PR, we convert hf weight to dist ckpt by offline scripts, and
each GPU will only load its parts from dist ckpt. The speed is faster
and no more online resharding needed.

When loading `Qwen2-7B-Instruct` for critic worker, the loading time
reduced from 109s to 25s, speedup by *4.36x*

The `converter_hf_to_mcore.py` in this version use existing online
resharding function to convert weights. And it should be refactored for
better efficiency and MoE/VLM models.
Thanks to volcengine#998 for the optimization of loading hf weight only at GPU 0.

Future TODO:
* refactor the converter for efficiency
* support converting MoE models
* support converting VLM models
* re-design `megatron_checkpoint_manager.py` with dist ckpt
* implement converter from mcore dist ckpt to hf / `model_merger.py`
* add docs and example scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants