Skip to content

[ROADMAP] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

@yanring

Description

@yanring

Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

Model Support

  • DeepSeek
    • ✅ DeepSeek-V2
    • ✅ DeepSeek-V3, including MTP
  • Qwen
    • ✅ Qwen2-57B-A14B
    • ✅ Qwen3-30B-A3B
    • ✅ Qwen3-235B-A22B
  • Mixtral
    • ✅ Mixtral-8x7B
    • ✅ Mixtral-8x22B

Core MoE Functionality

  • Token dropless MoE (dMoE) - Advanced routing without token dropping
  • Top-K Router with flexible K selection
  • Load balancing losses for expert utilization optimization

Advanced Parallelism

  • Expert Parallel (EP) with 3D parallelism integration
  • Full parallelism combo: EP + DP + TP + PP + SP support
  • Context Parallel (CP) for long sequence MoE training
  • Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)

Performance Optimizations

  • Memory Efficient token permutation
  • Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
  • MLA TP Support with MoE parallel folding
  • GroupedGEMM and GA Fusion
  • DP/PP/TP Communication Overlapping
  • Overlapped Shared Expert execution
  • Router Fusion optimizations
  • Token (un)permutation Fusion kernels
  • MLA Kernel Optimization for Hopper and Blackwell from cuDNN

Hardware & Precision Support

  • DeepEP support for H100 and B200
  • GroupedGEMM including FP8/MXFP8 support
  • FP8 training with blockwise/deepseek/mxfp8 recipe
  • FP8 weights with BF16 optimizer states

Developer Experience

  • MoE Model Zoo with pre-training performance best practices
  • Distributed Checkpointing for MoE models
  • Upcycling Support for low-cost model scaling
  • MCore2HF Converter in Megatron-Bridge
  • Layer-wise logging for detailed monitoring
  • Runtime Upcycling capabilities

Next Release Roadmap (MCore v0.14)

Memory Enhancements

  • 🚀FP8 support for Fine-grained Recomputations - Precision + memory efficiency

Advanced Functionality

  • Fusions (Router) - Optimized router kernel fusion
  • MLA CP and Packed Sequence Support (Context Parallel) - MLA with context parallelism
  • MTP CP Support - MTP and DeepSeek-V3 with CP
  • 🚀 CUDA Graph Enhancement
    · Flexible Virtual Pipeline Parallel Layout Support
    · FP8 operations
    · More graphable scopes like MoE router and dispatcher preprocessing

Communication Optimization

  • 🚀1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule

Bug Fix

  • MTP issues with VPP and TP - Fixed the a low impact correctness issue with TP and a compatibility issue with VPP

Ongoing Long-term Features

  • E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
  • Super long sequences (>128K) for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
  • Sync-Free MoE
  • Full-Iter cudaGraph MoE
  • A more flexible MTP placement strategy when using VPP
  • CPU Overhead reduction for better host utilization
  • Custom FSDP - NVIDIA-optimized FSDP implementation with full expert parallel support
  • MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
  • CPU Offloading
  • Enhanced EP communication Kernel for GB200

Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

Labels: roadmap, moe, call-for-contribution
Milestone: MCore v0.14

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions