[ROADMAP] Megatron Core MoE Q3-Q4 2025 Roadmap

## Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

### Model Support

* ✅ **DeepSeek**
  * ✅ DeepSeek-V2
  * ✅ DeepSeek-V3, including MTP
* ✅ **Qwen**
  * ✅ Qwen2-57B-A14B
  * ✅ Qwen3-30B-A3B
  * ✅ Qwen3-235B-A22B
* ✅ **Mixtral**
  * ✅ Mixtral-8x7B
  * ✅ Mixtral-8x22B

### Core MoE Functionality

* ✅ **Token dropless MoE (dMoE)** \- Advanced routing without token dropping
* ✅ **Top-K Router** with flexible K selection
* ✅ **Load balancing losses** for expert utilization optimization

### Advanced Parallelism

* ✅ **Expert Parallel (EP)** with 3D parallelism integration
* ✅ **Full parallelism combo**: EP \+ DP \+ TP \+ PP \+ SP support
* ✅ **Context Parallel (CP)** for long sequence MoE training
* ✅ **Parallel Folding** Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
* ✅ **Distributed Optimizer for MoE** (ZeRO-1 equivalent)
* ✅ **Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout** (--pipeline-model-parallel-layout)

### Performance Optimizations

* ✅ **Memory Efficient token permutation**
* ✅ **Fine-grained Recomputations** (mla, moe, mlp, moe\_act, norm)
* ✅ **MLA TP Support** with MoE parallel folding
* ✅ **GroupedGEMM** and GA Fusion
* ✅ **DP/PP/TP Communication Overlapping**
* ✅ **Overlapped Shared Expert** execution
* ✅ **Router Fusion** optimizations
* ✅ **Token (un)permutation Fusion** kernels
* ✅ **MLA Kernel Optimization for Hopper and Blackwell from cuDNN** 

### Hardware & Precision Support

* ✅ **DeepEP support for H100 and B200**
* ✅ **GroupedGEMM** including FP8/MXFP8 support
* ✅ **FP8 training** with blockwise/deepseek/mxfp8 recipe
* ✅ **FP8 weights with BF16 optimizer states**

### Developer Experience

* ✅ **[MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)** with pre-training performance best practices
* ✅ **Distributed Checkpointing** for MoE models
* ✅ **Upcycling Support** for low-cost model scaling
* ✅ **MCore2HF Converter** in [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)
* ✅ **Layer-wise logging** for detailed monitoring
* ✅ **Runtime Upcycling** capabilities

## Next Release Roadmap (MCore v0.14)

### Memory Enhancements

- [x] **🚀FP8 support for Fine-grained Recomputations** \- Precision \+ memory efficiency

### Advanced Functionality

- [x] **Fusions (Router)** \- Optimized router kernel fusion
- [x] **MLA CP and Packed Sequence Support (Context Parallel)** \- MLA with context parallelism
- [x] **MTP CP Support \- MTP and DeepSeek-V3 with CP**
- [x] **🚀 CUDA Graph Enhancement**
      · Flexible Virtual Pipeline Parallel Layout Support
      · FP8 operations
      · More graphable scopes like MoE router and dispatcher preprocessing

### Communication Optimization

- [x] **🚀1F1B EP A2A Overlap** \- Hiding Expert Parallel Communication with 1F1B Pipeline Schedule

### Bug Fix

- [x] **MTP issues with VPP and TP \-**  Fixed the a low impact correctness issue with TP and a compatibility issue with VPP

## Ongoing Long-term Features

* **E2E Performance optimization** for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
* **Super long sequences (>128K)** for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
* **Sync-Free MoE**
* **Full-Iter cudaGraph MoE**
* **A more flexible MTP placement strategy when using VPP**
* **CPU Overhead reduction** for better host utilization
* **Custom FSDP** \- NVIDIA-optimized FSDP implementation with full expert parallel support
* **MLA CP 2.0 \- MLA CP Enhancement for Longer Sequence Training**
* **CPU Offloading**
* **Enhanced EP communication Kernel for GB200**

## Call for Community Contributions

* **Model implementations** \- Additional MoE model variants  
* **Performance testing** \- Performance tests across different platforms and workloads  
* **Documentation and tutorials** \- Best practices and optimization guides

---

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

**Labels:** `roadmap`, `moe`, `call-for-contribution`  
**Milestone:** MCore v0.14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROADMAP] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Performance Optimizations

Hardware & Precision Support

Developer Experience

Next Release Roadmap (MCore v0.14)

Memory Enhancements

Advanced Functionality

Communication Optimization

Bug Fix

Ongoing Long-term Features

Call for Community Contributions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROADMAP] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Description

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Performance Optimizations

Hardware & Precision Support

Developer Experience

Next Release Roadmap (MCore v0.14)

Memory Enhancements

Advanced Functionality

Communication Optimization

Bug Fix

Ongoing Long-term Features

Call for Community Contributions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions