Features that npu will focus on supporting in Q3 #2171

zheliuyu · 2025-06-24T07:10:32Z

zheliuyu
Jun 24, 2025

Unfinished tasks in Q2

Support profiling function
Support megatron/mindspeed worker (for npu, megatron≈mindspeed)
GRPO/DAPO performance optimization

Q2 roadmap: #900

New features in Q3

Native support FSDP2 worker
- vllm-ascend supports torch>=2.6 [Feature]: Request vllm-ascend to support torch_npu>=2.6 vllm-project/vllm-ascend#1390
- Accuracy aligned to FSDP1
torch.compile
SGLang? maybe

iWasOmen · 2025-06-24T08:13:52Z

iWasOmen
Jun 24, 2025

can we support pangu model in the future?

1 reply

zheliuyu Jun 24, 2025
Author

As long as pangu is an open source model, you only need to modify the actor_rollout_ref.model.path in GRPO/DAPO, and no additional adaptation work is required.

FightingZhen · 2025-06-27T01:58:30Z

FightingZhen
Jun 27, 2025
Collaborator

zheliuyu Aug 17, 2025
Author

@FightingZhen Could you please annotate the features that are already supported. 😄

FightingZhen Aug 19, 2025
Collaborator

I have changed above comments with clickable checkboxes

qwertyasd-789 Sep 5, 2025

could you please provide what version of CANN is needed to support async VLLM? THX!

FightingZhen Sep 5, 2025
Collaborator

could you please provide what version of CANN is needed to support async VLLM? THX!

As far as I know, async vllm depends on vllm-ascend==0.9.1 + CANN==8.2.RC1

xinyubai1209 · 2025-09-04T11:49:13Z

xinyubai1209
Sep 4, 2025

I use vllm-ascend to run grpo+multi-turn function calling, but i found that tool is not be called. Is not support npu + multi-turn function calling? thx!
Train Script(Reference Sglang examples):

# run on 4xH100
# make sure your current working directory is the root of the project

set -x
export HYDRA_FULL_ERROR=1
export VLLM_USE_V1=1
ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"

python3 -u -m verl.trainer.main_ppo \
    --config-path="$CONFIG_PATH" \
    --config-name='gsm8k_multiturn_grpo' \
    algorithm.adv_estimator=grpo \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=True \
    actor_rollout_ref.model.path=/vllm-workspace/Qwen2.5-3B-Instruct \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.mode=async \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.val_before_train=True \
    trainer.critic_warmup=0 \
    trainer.logger='["console", "swanlab"]' \
    trainer.project_name='gsm8k_async_rl_0904' \
    trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-async-vllm-ascend-multi-w-tool-verify-n16-4cards_0904' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.total_training_steps=100 \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=8192 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=8192 \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=8192 \
    critic.ppo_max_token_len_per_gpu=8192 \
    critic.forward_max_token_len_per_gpu=8192 \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    actor_rollout_ref.rollout.multi_turn.enable=True \
    actor_rollout_ref.rollout.multi_turn.format=hermes \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
    actor_rollout_ref.rollout.multi_turn.max_user_turns=1 \
    trainer.device=npu $@

1 reply

zheliuyu Sep 4, 2025
Author

Thank you for highlighting this. GRPO + multi-turn is not currently implemented. We would welcome a contribution if you are interested in working on it.

Features that npu will focus on supporting in Q3 #2171

Uh oh!

Uh oh!

zheliuyu Jun 24, 2025

Unfinished tasks in Q2

New features in Q3

Replies: 3 comments · 6 replies

Uh oh!

iWasOmen Jun 24, 2025

Uh oh!

Uh oh!

zheliuyu Jun 24, 2025 Author

Uh oh!

Uh oh!

FightingZhen Jun 27, 2025 Collaborator

Uh oh!

zheliuyu Aug 17, 2025 Author

Uh oh!

FightingZhen Aug 19, 2025 Collaborator

Uh oh!

qwertyasd-789 Sep 5, 2025

Uh oh!

FightingZhen Sep 5, 2025 Collaborator

Uh oh!

xinyubai1209 Sep 4, 2025

Uh oh!

zheliuyu Sep 4, 2025 Author

zheliuyu
Jun 24, 2025

Replies: 3 comments 6 replies

iWasOmen
Jun 24, 2025

zheliuyu Jun 24, 2025
Author

FightingZhen
Jun 27, 2025
Collaborator

zheliuyu Aug 17, 2025
Author

FightingZhen Aug 19, 2025
Collaborator

FightingZhen Sep 5, 2025
Collaborator

xinyubai1209
Sep 4, 2025

zheliuyu Sep 4, 2025
Author