-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Description
Every development and optimization to veRL must consider larger model in distributed environment.
GPTModel integration alignment with FSDP and raw mcore models
It is very important to ensure the algorithm precisions align with each other. Currently, mcore can still be divergent from FSDP in some circumstances.
- GPTModel and TransformerEngine shall integrate into veRL successfully at first.
- We may need to figure out a better way to support three backends for the coming version. When GPTModel is stable, we can remove the raw models of mcore.
- A calibration tool is needed for aligning different backends, able to compare data of three systems in different running steps.
- The root cause of alignment issues shall be found, and the problem can be fixed though the tool.
Checkpoints for GPTModel and Megatron dist_checkpointing
- Need to use
dist_checkpointing
in mcore to accelerate the checkpoints saving and loading process. - Support GPTModel format models' weights.
Fused Kernels to improve Megatron MFU
Currently there are some kernels that are memory-bound, making us hard to increase the handling batch size and digits length. Like the entropy loss calculation process.
- This will be integrated into raw models soon
- How these kernels patch into GPTModel needs further study.
Support CPU memory offloading to run larger models
- Megatron should support param/grad/optimizer load/offload, the functions need to check and enable.
Token balance between PP workloads
- Make PP workloads more balanced to reduce pipeline execution bubbles.
The MoE or cross-machine TP communication optimization
- Both Megatron and SGLang/vLLM shall support and check the efficiency of cross-nodes MoE part in the model.
Optimization
- Profile and discover the calculation and memory bottleneck of the whole post-training process.
eric-haibin-lin, ccclyu, haolin-nju, none0663, ann-qin-lu and 3 more
Metadata
Metadata
Assignees
Labels
No labels