[RFC][sub roadmap][25Q2] Add Ascend NPU support for verl #900

zheliuyu · 2025-03-31T08:23:09Z

zheliuyu
Mar 31, 2025

Quick Start

document: ascend_quick_start.rst

Plan

Dependencies (Q1 done)

Native support transformers
Native support ray
Native support FSDP worker
- Review PR: [NPU] feat: Support FSDP worker and vLLM Ascend #332
Support vLLM-ascend v0.7.3 (Some features have been temporarily circumvented and marked in the Q2 Plan)

Q2 Plan

vLLM-ascend sleep mode: [Feature]: Support sleep mode in vllm-ascend vllm-project/vllm-ascend#320
--use_remove_padding
LLM FSDP GRPO
LLM FSDP DAPO
VLM FSDP GRPO
Support profiling function
Support megatron/mindspeed worker (for npu, megatron≈mindspeed)
Performance Optimization

Release Accuracy Comparison Results

Modify the default config as little as possible to keep the accuracy.

algorithm	model	rewards mae	throughput ratio
GRPO	Qwen2.5-7B	0.38%	0.588
GRPO	Qwen2.5-32B	0.30%	0.685
GRPO	Qwen2.5-VL-3B	3.14%	0.470
GRPO	Qwen2.5-VL-7B	3.30%	0.380
GRPO	Qwen2.5-VL-32B	0.79%	0.568
DAPO	Qwen2.5-7B	3.73%	/
DAPO	Qwen2.5-32B	4.32%	/

Easy of use

flash-attn is not supported on Ascend NPU. So we need to use torch_npu.npu_fusion_attention to replace.

~~[Temporary solution] NPU support SDPA: NPU support SDPA huggingface/transformers#35165~~
Using flash_attn on NPU @FightingZhen: [Feature] Support using FlashAttention2 on Ascend NPU huggingface/transformers#36696

Long-term Planning

torch.compile

hiyouga · 2025-03-31T15:07:29Z

hiyouga
Mar 31, 2025
Collaborator

Currently verl adopts DeepSpeed Ulysses for long-context training. Ulysses should be natively compatible with Ascend NPU since it relies on all2all communication

0 replies

zheliuyu · 2025-04-01T03:31:35Z

zheliuyu
Apr 1, 2025
Author

Currently verl adopts DeepSpeed Ulysses for long-context training. Ulysses should be natively compatible with Ascend NPU since it relies on all2all communication

Hi hiyouga, thanks for your reminder.😊 DeepSpeed Ulysses in verl is included in GRPO algo of Qwen series models. Next we will verify it and publish results, so stay tuned.

0 replies

zheliuyu · 2025-07-03T06:19:22Z

zheliuyu
Jul 3, 2025
Author

Q3 roadmap: #2171

0 replies

Sofilyzia · 2025-07-11T03:32:30Z

Sofilyzia
Jul 11, 2025

The environment has been configured successfully. But the operation got stuck at WARNING:2025-07-11 11:28:53,278:Waiting for register center actor aTAAB0_register_center to be ready. Elapsed time: 0 seconds out of 300 seconds. May I ask what the possible reason might be?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][sub roadmap][25Q2] Add Ascend NPU support for verl #900

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[RFC][sub roadmap][25Q2] Add Ascend NPU support for verl #900

Uh oh!

Uh oh!

zheliuyu Mar 31, 2025

Contents

Quick Start

Plan

Dependencies (Q1 done)

Q2 Plan

Release Accuracy Comparison Results

Easy of use

Long-term Planning

Replies: 4 comments

Uh oh!

hiyouga Mar 31, 2025 Collaborator

Uh oh!

Uh oh!

zheliuyu Apr 1, 2025 Author

Uh oh!

zheliuyu Jul 3, 2025 Author

Uh oh!

Sofilyzia Jul 11, 2025

zheliuyu
Mar 31, 2025

hiyouga
Mar 31, 2025
Collaborator

zheliuyu
Apr 1, 2025
Author

zheliuyu
Jul 3, 2025
Author

Sofilyzia
Jul 11, 2025