[TODO List] Towards Deepseek 671B RLHF with mcore 0.11.0

Every development and optimization to veRL must consider larger model in distributed environment.

# GPTModel integration alignment with FSDP and raw mcore models

It is very important to ensure the algorithm precisions align with each other. Currently, mcore can still be divergent from FSDP in some circumstances.

- [ ] GPTModel and TransformerEngine shall integrate into veRL successfully at first.
- [ ] We may need to figure out a better way to support three backends for the coming version. When GPTModel is stable, we can remove the raw models of mcore.
- [ ] A calibration tool is needed for aligning different backends, able to compare data of three systems in different running steps.
- [ ] The root cause of alignment issues shall be found, and the problem can be fixed though the tool.

# Checkpoints for GPTModel and Megatron dist_checkpointing

- [ ] Need to use `dist_checkpointing` in mcore to accelerate the checkpoints saving and loading process.
- [ ] Support GPTModel format models' weights.

# Fused Kernels to improve Megatron MFU

Currently there are some kernels that are memory-bound, making us hard to increase the handling batch size and digits length. Like the entropy loss calculation process.

- [ ] This will be integrated into raw models soon
- [ ] How these kernels patch into GPTModel needs further study.

# Support CPU memory offloading to run larger models

- [ ] Megatron should support param/grad/optimizer load/offload, the functions need to check and enable.

# Token balance between PP workloads

- [ ] Make PP workloads more balanced to reduce pipeline execution bubbles.

# The MoE or cross-machine TP communication optimization

- [ ] Both Megatron and SGLang/vLLM shall support and check the efficiency of cross-nodes MoE part in the model.

# Optimization

- [ ] Profile and discover the calculation and memory bottleneck of the whole post-training process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TODO List] Towards Deepseek 671B RLHF with mcore 0.11.0 #825

GPTModel integration alignment with FSDP and raw mcore models

Checkpoints for GPTModel and Megatron dist_checkpointing

Fused Kernels to improve Megatron MFU

Support CPU memory offloading to run larger models

Token balance between PP workloads

The MoE or cross-machine TP communication optimization

Optimization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[TODO List] Towards Deepseek 671B RLHF with mcore 0.11.0 #825

Description

GPTModel integration alignment with FSDP and raw mcore models

Checkpoints for GPTModel and Megatron dist_checkpointing

Fused Kernels to improve Megatron MFU

Support CPU memory offloading to run larger models

Token balance between PP workloads

The MoE or cross-machine TP communication optimization

Optimization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions