Skip to content

[TODO List] Towards Deepseek 671B RLHF with mcore 0.11.0 #825

@ETOgaosion

Description

@ETOgaosion

Every development and optimization to veRL must consider larger model in distributed environment.

GPTModel integration alignment with FSDP and raw mcore models

It is very important to ensure the algorithm precisions align with each other. Currently, mcore can still be divergent from FSDP in some circumstances.

  • GPTModel and TransformerEngine shall integrate into veRL successfully at first.
  • We may need to figure out a better way to support three backends for the coming version. When GPTModel is stable, we can remove the raw models of mcore.
  • A calibration tool is needed for aligning different backends, able to compare data of three systems in different running steps.
  • The root cause of alignment issues shall be found, and the problem can be fixed though the tool.

Checkpoints for GPTModel and Megatron dist_checkpointing

  • Need to use dist_checkpointing in mcore to accelerate the checkpoints saving and loading process.
  • Support GPTModel format models' weights.

Fused Kernels to improve Megatron MFU

Currently there are some kernels that are memory-bound, making us hard to increase the handling batch size and digits length. Like the entropy loss calculation process.

  • This will be integrated into raw models soon
  • How these kernels patch into GPTModel needs further study.

Support CPU memory offloading to run larger models

  • Megatron should support param/grad/optimizer load/offload, the functions need to check and enable.

Token balance between PP workloads

  • Make PP workloads more balanced to reduce pipeline execution bubbles.

The MoE or cross-machine TP communication optimization

  • Both Megatron and SGLang/vLLM shall support and check the efficiency of cross-nodes MoE part in the model.

Optimization

  • Profile and discover the calculation and memory bottleneck of the whole post-training process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions