-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Description
The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
Model Support
- ✅ DeepSeek
- ✅ DeepSeek-V2
- ✅ DeepSeek-V3, including MTP
- ✅ Qwen
- ✅ Qwen2-57B-A14B
- ✅ Qwen3-30B-A3B
- ✅ Qwen3-235B-A22B
- ✅ Mixtral
- ✅ Mixtral-8x7B
- ✅ Mixtral-8x22B
Core MoE Functionality
- ✅ Token dropless MoE (dMoE) - Advanced routing without token dropping
- ✅ Top-K Router with flexible K selection
- ✅ Load balancing losses for expert utilization optimization
Advanced Parallelism
- ✅ Expert Parallel (EP) with 3D parallelism integration
- ✅ Full parallelism combo: EP + DP + TP + PP + SP support
- ✅ Context Parallel (CP) for long sequence MoE training
- ✅ Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- ✅ Distributed Optimizer for MoE (ZeRO-1 equivalent)
- ✅ Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
Performance Optimizations
- ✅ Memory Efficient token permutation
- ✅ Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
- ✅ MLA TP Support with MoE parallel folding
- ✅ GroupedGEMM and GA Fusion
- ✅ DP/PP/TP Communication Overlapping
- ✅ Overlapped Shared Expert execution
- ✅ Router Fusion optimizations
- ✅ Token (un)permutation Fusion kernels
- ✅ MLA Kernel Optimization for Hopper and Blackwell from cuDNN
Hardware & Precision Support
- ✅ DeepEP support for H100 and B200
- ✅ GroupedGEMM including FP8/MXFP8 support
- ✅ FP8 training with blockwise/deepseek/mxfp8 recipe
- ✅ FP8 weights with BF16 optimizer states
Developer Experience
- ✅ MoE Model Zoo with pre-training performance best practices
- ✅ Distributed Checkpointing for MoE models
- ✅ Upcycling Support for low-cost model scaling
- ✅ MCore2HF Converter in Megatron-Bridge
- ✅ Layer-wise logging for detailed monitoring
- ✅ Runtime Upcycling capabilities
Next Release Roadmap (MCore v0.14)
Memory Enhancements
- 🚀FP8 support for Fine-grained Recomputations - Precision + memory efficiency
Advanced Functionality
- Fusions (Router) - Optimized router kernel fusion
- MLA CP and Packed Sequence Support (Context Parallel) - MLA with context parallelism
- MTP CP Support - MTP and DeepSeek-V3 with CP
- 🚀 CUDA Graph Enhancement
· Flexible Virtual Pipeline Parallel Layout Support
· FP8 operations
· More graphable scopes like MoE router and dispatcher preprocessing
Communication Optimization
- 🚀1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
Bug Fix
- MTP issues with VPP and TP - Fixed the a low impact correctness issue with TP and a compatibility issue with VPP
Ongoing Long-term Features
- E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Super long sequences (>128K) for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Sync-Free MoE
- Full-Iter cudaGraph MoE
- A more flexible MTP placement strategy when using VPP
- CPU Overhead reduction for better host utilization
- Custom FSDP - NVIDIA-optimized FSDP implementation with full expert parallel support
- MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
- CPU Offloading
- Enhanced EP communication Kernel for GB200
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels: roadmap
, moe
, call-for-contribution
Milestone: MCore v0.14
Ktakuya332C, Skylion007, jeromeku, GHGmc2, Victarry and 6 moresbmaruf, sbhavani, Skylion007, joshelb, eagle705 and 2 moreTJ-Solergibert, Skylion007, joshelb, okoge-kaz and blahBlahhhJ