[Project] deepseek R1 infrastructure

The verl team like to invite the community to help contribute to build infrastructure to efficiently scale up to models of deepseek scale and develop recipes to reproduce deepseek r1 results for the broader open source and research community. You're encouraged to join the slack channel for discussions on sub-topics.

# Evals

Tasks: https://github.com/volcengine/verl/pull/777/files 
- add evaluation script to reproduce ds-r1 results on several benchmarks from ds-r1 checkpoint
  - [x] gpqa diamond (english)
  - [x] LiveCodeBench (code)
  - [x] SWE-bench Verified (code)
  - [x] CNMO 2024 (math)

Notes:
- refer to `examples/data_preprocess` for data preprocessing examples
- refer to trainer/main_generation.py for generation (but evaluation code is missing)
- if possible, upload preprocessed dataset to huggingface
- start with ds distilled models for quick verification before scaling up to 671b

# Training engine 
Tasks: 
- verify `GPTModel` with mcore v0.11. There’s experimental integration of critic and actor models with GPTModel class
  - [x] verify checkpoint manager with GPTModel 
  - [x] Investigate the issue of convergence issue of seq packing when micro bsz > 1 @GAIR-NLP 
Mcore ds-v3 perf tuning 
  - [ ] Run ds-v3 with mcore on a range of seqlen and GPU count, tune nd parallel / memory configs 
- Mcore ds-v3 convergence verification: to ensure the correctness of mcore ds-v3 implementation, we shall run a smaller version of ds with same model architecture. 
  - [x] obtain a medium-sized MOE pretrained ckpt
  - [x] Compare mcore convergence with FSDP + medium-sized MOE
  - [x] Verify mcore expert parallelism correctness https://github.com/volcengine/verl/pull/1467
  - [ ] Optimize ds-v3 long context throughout with FSDP backend for faster FSDP experiment iterations (e.g. liger alignment loss, recompute strategy, ring attn). 
  - [ ] Run 671b https://github.com/volcengine/verl/pull/1771

Notes:
- To make the perf tuning more faithful, it’s better to develop a benchmark script that include alignment loss (CE loss, CE and entropy loss) 
- GAIR NLP team is looking GPTModel with sequence packing support when micro_bsz > 1

# Data & recipe 

Tasks:
Math and scientific sources:  
- [ ] Further improve the dataset used for math RL, maybe based on DAPO 17k and other open source ones
Code:
- [ ] Build a baseline code recipe in verl main repo, using small models such as llama or qwen-7b
- [ ] Curate dataset for code RL training (start with existing open source ones) 

Notes:
- Provide reproducible command and logs in https://verl.readthedocs.io/en/latest/experiment/ppo.html 

# Inference Engine
Tasks
-  [x] verify multi-node TP inference
- [ ] support multi-node EP/PP inference
-  [x] sharding manager support with mcore v0.11 + latest version of inference engines 

Related TODOs: https://github.com/volcengine/verl/issues/825

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Project] deepseek R1 infrastructure #708

Evals

Training engine

Data & recipe

Inference Engine

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Project] deepseek R1 infrastructure #708

Description

Evals

Training engine

Data & recipe

Inference Engine

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions