Skip to content

dongguanting/ARPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨ Agentic Reinforced Policy Optimization

Paper Paper Model Dataset License Python 3.10+ X (formerly Twitter) URL

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

  • [Aug 11, 2025]: The brief introduction of ARPO can be found on a series of platforms like X, WeChat, Zhihu, YouTube and Xiaohongshu.
  • [July 29, 2025]: 🔥 We are honored to be featured as 🤗 HuggingFace Daily Paper #1 and Weekly Paper #1.
  • [July 29, 2025]: 📄 Our paper is now available on arXiv and Hugging Face daily paper.
  • [July 25, 2025]: 🔥 We released all our ARPO model checkpoints (3B~14B) and datasets(SFT, RL, Evaluation). Checkout 🤗ARPO Collection here. We will keep update it!
  • [July 25, 2025]: 🚀 Full codebase released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models. We have implemented extensive tool-call acceleration and memory optimization during RL training.

Table of Contents

💡 Overview

We propose Agentic Reinforced Policy Optimization (ARPO), an agentic RL algorithm tailored for training multi-turn LLM-based agent. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.

intro
  • In figure (left), The initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.

  • In the figure (right), we validate ARPO's performance across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

🏃 Quick Start

Reproducing ARPO requires three steps: cold start fine-tuning (optional), ARPO training, and evaluation. Below, we will provide a detailed explanation.

❄️ Cold-Start SFT Stage (Optional)

This stage is meant to help you reproduce our experimental results. If your want to RL from scratch, you can skip this stage.

1. Environment Setup

In this step, we will describe how to perform a cold start for the SFT stage using the LLaMA Factory repository. First, set up the environment as follows:

# Clone the ARPO repository (which includes LLaMA-Factory)
git clone https://github.com/dongguanting/ARPO
cd ARPO/LLaMA-Factory

# Create a new conda environment
conda create -n sft python=3.10
conda activate sft

# Install dependencies
pip install -r requirements.txt

2. Fine-Tuning Model

  1. Download your SFT dataset from 🤗ARPO-SFT-54K and place it in LLaMA-Factory-main/data/final_sft_edition9.json. Define the dataset in dataset_info.json.

  2. Configure Training

Update LLaMA-Factory/arpo_train_sft/yaml with the following content:

### model
model_name_or_path: <your_model_path>
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: ../examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset_dir: dataset_info
dataset: <your_dataset>
template: qwen
cutoff_len: 15000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: <your_output_dir>
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 7.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

Also, update the output directory in arpo_train_sft/sft_train.sh:

# Output directory
OUTPUT_DIR="<your_output_dir>"

After completing the information, you can fine-tune the model using the following command:

bash arpo_train_sft/sft_train.sh

🔥 ARPO Stage

In this step, we will load the cold-start data for GRPO training. We reference the ReCall and VERL frameworks for RL training.

1. Environment Setup

you can install our additional environment as follow:

#create env
conda create -n arpo python==3.10
conda activate arpo

# install torch & flash-atten
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

# install RL basic env
cd arpo

# This is our RL env freeze file. You can install it as a supplement or use it for checking.
pip install -r requirements.txt

2. Preparation

2.1 Data Preparation

In our paper, we offer two type of train & validation datasets to verify the effectiveness of ARPO:

  1. Reasoning and Knowledge Dataset: This dataset is used to test the benchmarks listed in Table 1.

    • train_10k.parquet: Contains 10K samples for mathematical and knowledge reasoning.
    • test.parquet: Comprises 300 test samples from 8 datasets, including AIME24, AIME25, MATH500, GSM8k, HotpotQA, 2Wiki, Misque, and Bamboogle.
  2. Deep Search Dataset: This dataset is used to test the benchmarks listed in Table 2.

    • hard_search.parquet: Contains 1K samples, including 800 samples from simpledeepsearch and 200 samples from webdancer.
    • gaia_test.parquet/hle_test.parquet: Contains test samples from GAIA and Humanity Last Exam (HLE).

2.2 API Key Configuration

Our search api tool utilizes Bright Data (A third-party Bing API, without the retirement risk of official Bing API). Before starting the training, please replace the API key and zone in the following files: ARPO/scripts/config/ppo_trainer_dr.yaml and ARPO/scripts/config/ppo_trainer.yaml.

Additionally, please also replace the API key and zone in the following file: /verl_arpo_entropy/verl/workers/rollout/tools/config_example.yaml. Below is the instruction on how to do this:

🔍 Click here! Watch the details of tool API configuration YAML
tools:
  # General tool configuration
  call_limit: 3  # Maximum number of tool calls allowed per sample
  max_workers: 64  # Maximum number of threads for concurrent tool execution
  timeout: 120  # Tool execution timeout (seconds)
  retry_count: 3  # Number of retry attempts for tool execution failures
  verbose_logging: true  # Enable detailed logging
  fail_on_error: false  # Throw an exception if tool loading fails
  
  # Tool instance definitions
  tool_instances:
    python:  
      class_path: verl.workers.rollout.tools.python_tool.PythonTool  # Tool class path
      params:  # Tool-specific parameters
        conda_path: /path/to/conda
        conda_env: verl
    
    search:
      class_path: verl.workers.rollout.tools.search_tool.BingSearchTool
      params:
        api_key: <your_API_key>  # Replace with your Bright Data API key
        zone: <your_zone>  # Replace with your Bright Data zone
        max_results: 10
        result_length: 1000
        location: cn

Make sure to replace <your_API_key> and <your_zone> with your actual Bright Data API key and zone. This configuration ensures that the search tool is properly set up to perform searches during the training process. If you have any questions or need further assistance, feel free to ask!


3. ARPO RL Training

We have open-sourced a series of ARPO scripts located in the /ARPO/scripts/ directory, which includes configurations for 7B, 8B, and 14B models. Below is an example of how to set up and run training for training ARPO. Make sure to replace placeholders like <your_path_to_ARPO>, <your_model_path>, and <your_checkpoint_save_dir> with your actual paths.

🔍 Click here! Watch the details of train bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
PARENT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PARENT_DIR"
echo "Switched to parent directory: $PARENT_DIR"


# ============================ Environment Setup ============================
# Set basic environment variables
export PYTHONUNBUFFERED=1            
export HYDRA_FULL_ERROR=1           
export VLLM_ATTENTION_BACKEND=XFORMERS 
export VERL_LOGGING_LEVEL=DEBUG
export MKL_SERVICE_FORCE_INTEL=1    
export MKL_THREADING_LAYER=GNU       
export RAY_memory_usage_threshold=0.8  
export RAY_memory_monitor_refresh_ms=0 


# Set Python path
export PYTHONPATH="<your_path_to_ARPO>"/verl_arpo_entropy:$PYTHONPATH

# ============================ Basic Configuration ============================
# Experiment name and project
PROJECT_NAME="reasoning_tasks" # Modify experiment group
EXPERIMENT_NAME="ARPO_global_16_init_8_beam_2_random_0_arpo_0.2_entropy" # Modify experiment name

# Configuration file path
CONFIG_PATH="<your_path_to_ARPO>/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended
CONFIG_NAME="ppo_trainer.yaml"

# Distributed training settings
NNODES=1                            
N_GPUS_PER_NODE=8                   

# ============================ Data Configuration ============================
# Data parameters
PROMPT_KEY="prompt"                 # Prompt field name
TRAIN_BATCH_SIZE=128                # Training batch size
PPO_MINI_BATCH_SIZE=16              # PPO mini-batch size
MAX_PROMPT_LENGTH=1536              # Maximum prompt length
MAX_RESPONSE_LENGTH=4096            # Maximum response length

# Data file paths
TRAIN_FILES="<your_path_to_ARPO>/rl_datasets/train.parquet" # Modify training data path
VALID_FILES="<your_path_to_ARPO>/rl_datasets/valid.parquet" # Modify validation data path

# ============================ Model Configuration ============================
# Actor model path
ACTOR_MODEL_PATH="<your_model_path>" # Modify training model path

# ============================ Rollout Configuration ==========================
# Rollout settings
ROLLOUT_NAME="vllm"                 # Use vllm engine
ROLLOUT_MODE="sync_with_tool"       # Synchronous mode with tool support
ROLLOUT_N=16                         # Number of responses generated per sample
INITIAL_ROLLOUTS=8                 # Initial rollout number
BEAM_SIZE=2                        # Beam size
BRANCH_PROBABILITY=0.5             # Branch probability
Entropy_weight=0.2
# ============================ Rollout Tools Configuration ==========================
SEARCH_CACHE_PATH="<your_path_to_ARPO>/search_cache/search_cache.json" # Modify

# ============================ Reward Model Configuration ==========================
# Reward model settings
REWARD_MANAGER="naive"              # Reward manager type
CUSTOM_REWARD_FUNCTION_PATH="<your_path_to_ARPO>/verl_arpo_entropy/verl/utils/reward_score/deep_research.py" # Modify reward function path
CUSTOM_REWARD_FUNCTION_NAME="compute_score"

# ============================ Training Configuration ============================
# Training parameters
TOTAL_EPOCHS=2                      # Total training epochs
SAVE_FREQ=5                        # Save frequency
TEST_FREQ=5                        # Test frequency

# ============================ Path Configuration ============================
# Save path
SAVE_PATH="<your_checkpoint_save_dir>/${EXPERIMENT_NAME}" # Modify save path
ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout"

# ============================ WandB Configuration ============================
# WandB settings
WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key
SEARCH_CLASS_PATH="verl.workers.agent.tools.search_tool.BingSearchTool"
# ============================ Preparation ============================
# Login to WandB (if API key is provided)
if [ "$WANDB_API_KEY" != "" ]; then
    wandb login --relogin $WANDB_API_KEY
    export WANDB_DIR=${SAVE_PATH}
fi

# Create save directory
if [ ! -d "$SAVE_PATH" ]; then
    mkdir -p $SAVE_PATH
fi

# Create rollout save directory
if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then
    mkdir -p $ROLLOUT_SAVE_PATH
fi

# ============================ Start Training ============================
python3 -m verl.trainer.main_ppo \
    --config-path=$CONFIG_PATH \
    --config-name=$CONFIG_NAME \
    algorithm.adv_estimator=grpo \
    algorithm.kl_ctrl.kl_coef=0.0 \
    data.train_files=${TRAIN_FILES} \
    data.val_files=${VALID_FILES} \
    data.prompt_key=${PROMPT_KEY} \
    data.train_batch_size=${TRAIN_BATCH_SIZE} \
    data.max_prompt_length=${MAX_PROMPT_LENGTH} \
    data.max_response_length=${MAX_RESPONSE_LENGTH} \
    actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \
    actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.n=${ROLLOUT_N} \
    actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \
    actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \
    actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \
    actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \
    actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \
    actor_rollout_ref.rollout.tools.tool_instances.search.class_path=${SEARCH_CLASS_PATH} \
    actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    reward_model.reward_manager=${REWARD_MANAGER} \
    custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \
    custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \
    trainer.critic_warmup=0 \
    trainer.logger="[console, wandb]" \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${EXPERIMENT_NAME} \
    trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \
    trainer.nnodes=${NNODES} \
    trainer.save_freq=${SAVE_FREQ} \
    trainer.test_freq=${TEST_FREQ} \
    trainer.total_epochs=${TOTAL_EPOCHS} \
    trainer.default_local_dir=${SAVE_PATH} \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \
    hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log

You can then run the following script to start training:

cd ./ARPO/scripts/
bash ARPO_7B_Reasoning_1node.sh

For the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format:

bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh

✅ ARPO Evaluation

If you have already trained a model, you can refer to the following process for TIR capability evaluation. Of course, you can also download our checkpoint from 🤗ARPO-Huggingface-Collection for directly testing. This guide walks you through setting up two separate environments:

  • One for vLLM inference service (vllm_env)
  • One for evaluation pipeline (evaluation)

1. Setup vLLM Inference Environment

# Step into the vllm_scripts directory
cd evaluation/vllm_scripts

# Create a dedicated conda environment for vLLM
conda create -n vllm_env python=3.10
conda activate vllm_env

# Install dependencies (edit as needed)
pip install -r requirements.txt

Edit the following launch scripts with your own model paths and names:

In vllm_launch_reasoning_model_cuda4-7.sh:

MODEL_PATH="<path/to/your/reasoning_model_checkpoint>"
MODEL_NAME="your_model_name"

For summarization models (choose one):

MODEL_PATH="<path/to/your/summarization_model_checkpoint>"
MODEL_NAME="your_summarization_model_name"

Launch the vLLM services:

# Start the reasoning model
bash vllm_launch_reasoning_model_cuda4-7.sh

# Start the summarization model (choose one)
bash vllm_launch_summarize_model_cuda0-3_<your_model>.sh

2. Setup Evaluation Environment

# Create a separate environment for evaluation
conda create -n evaluation python=3.10
conda activate evaluation

# Install required packages
cd evaluation
pip install -r requirements.txt

3. Configure and Run Evaluation

Edit the infer_local_sds.sh script with the following values:

# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >

# Datasets to evaluate — uncomment the ones you want to include:
# Options: aime24, aime25, math500, gsm8k, math, webwalker, hotpotqa, 2wiki, bamboogle, musique, hle, gaia, SimpleQA, xbench
data_names=(
    "hle"
    "gaia"
)

# Required parameters to update:
EXP_NAME="<your_exp_name>"                   # Name of this experiment run
MODEL_PATH="<your_model_path>"               # Path to the reasoning model
OUTPUT_PATH="<your_output_path>"             # Directory to save outputs
CONDA_PATH="<your_conda_path>"               # Path to your Conda installation
CONDA_ENV="<your_env_name>"                  # Name of your Conda environment
BING_API_KEY="<your_bing_search_api_key>"    # Bing Search API key
BING_ZONE="<your_bing_zone>"                 # Bing API zone
SUMM_MODEL_PATH="<your_summarization_model_path>"  # Path to summarization model checkpoints

For Bing API usage, please refer to Bright Data.

Run the evaluation:

bash evaluation/infer_local_sds.sh

🔸 For Chinese datasets like xbench, use infer_local_sds_cn.sh instead.

4. Calculate Metrics

After generating inference results, you can use a large model like Qwen2.5-72B-Instruct to evaluate them with more powerful understanding capabilities.

First, use the vLLM environment to start the evaluation model:

bash evaluation/deploy_qwen2.5_72B_instruct.sh

In that script, make sure to update the vllm serve command with your own model path:

# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >

vllm serve <your_model_path> \
  --served-model-name Qwen2.5-72B-Instruct \
  --max-model-len 32768 \
  --tensor_parallel_size 4 \
  --gpu-memory-utilization 0.75 \
  --quantization gptq \
  --port 8001

Before running the evaluation script, update the following line in evaluate_passk.sh to specify the output directory:

OUTPUT_DIR="<your_result_directory>"

Then, run the evaluation script to calculate metrics:

bash evaluation/evaluate_passk.sh

📄 Citation

If you find this work helpful, please cite our paper:

@article{dong2025arpo,
  title={Agentic Reinforced Policy Optimization},
  author={Dong, Guanting and Mao, Hangyu and Ma, Kai and Bao, Licheng and Chen, Yifei and Wang, Zhongyuan and Chen, Zhongxia and Du, Jiazhen and Wang, Huiyang and Zhang, Fuzheng and others},
  journal={arXiv preprint arXiv:2507.19849},
  year={2025}
}
@article{dong2025tool,
  title={Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning},
  author={Dong, Guanting and Chen, Yifei and Li, Xiaoxi and Jin, Jiajie and Qian, Hongjin and Zhu, Yutao and Mao, Hangyu and Zhou, Guorui and Dou, Zhicheng and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2505.16410},
  year={2025}
}

🤝 Acknowledge

This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

📄 License

This project is released under the MIT License.

📞 Contact

For any questions or feedback, please reach out to us at dongguanting@ruc.edu.cn.

Star History

Star History Chart

About

✨ Agentic Reinforced Policy Optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages