Skip to content

Conversation

zero-peace
Copy link
Collaborator

Readme (Tutorial)

1. Preparation: Training Hyperparameters

It is recommended to prepare the hyperparameters for training in advance.

Below is an example:

set -e -x

export MODEL_PATH=/your/path/to/huggingface.co/Qwen/Qwen3-4B
export REWARD_MODEL_PATH=/your/path/to/huggingface.co/Qwen/QwQ-32B
export RESULT_DIR=/your/path/to/results/rl_factory/your_result_dir

python3 -m verl.trainer.main_ppo\
    algorithm.adv_estimator=grpo\
    data.train_files=data/nq_search/train.parquet\
    data.val_files=data/nq_search/test.parquet\
    data.train_batch_size=128\
    data.max_prompt_length=4096\
    data.max_response_length=512\
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.model.use_remove_padding=True\
    actor_rollout_ref.model.enable_gradient_checkpointing=True\
    actor_rollout_ref.actor.optim.lr=1e-6\
    actor_rollout_ref.actor.ppo_mini_batch_size=32\
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16\
    actor_rollout_ref.actor.use_kl_loss=True\
    actor_rollout_ref.actor.kl_loss_coef=0.001\
    actor_rollout_ref.actor.kl_loss_type=low_var_kl\
    actor_rollout_ref.actor.fsdp_config.param_offload=True\
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True\
    actor_rollout_ref.actor.state_masking=True\
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16\
    actor_rollout_ref.rollout.tensor_model_parallel_size=1\
    actor_rollout_ref.rollout.name=vllm\
    actor_rollout_ref.rollout.gpu_memory_utilization=0.75\
    actor_rollout_ref.rollout.n=4\
    actor_rollout_ref.rollout.max_turns=2\
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16\
    actor_rollout_ref.ref.fsdp_config.param_offload=False\
    actor_rollout_ref.rollout.enforce_eager=False\
    actor_rollout_ref.rollout.free_cache_engine=False\
    actor_rollout_ref.env.name=search\
    actor_rollout_ref.env.mcp_mode=stdio\
    actor_rollout_ref.env.tool_manager=qwen3\
    actor_rollout_ref.env.enable_thinking=False\
    actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata\
    actor_rollout_ref.env.use_process_reward=False\
    reward_rollout.if_use_reward_rollout=False\
    reward_rollout.rollout.tensor_model_parallel_size=4\
    reward_rollout.rollout.gpu_memory_utilization=0.65\
    reward_rollout.rollout.model_name=$REWARD_MODEL_PATH\
    reward_rollout.rollout.free_cache_engine=False\
    reward_rollout.rollout.response_length=2048\
    reward_model.reward_manager=parallel\
    algorithm.kl_ctrl.kl_coef=0.001\
    trainer.critic_warmup=0\
    trainer.logger=['tensorboard']\
    trainer.project_name='GRPO_search'\
    trainer.experiment_name='search_with_thinking'\
    trainer.n_gpus_per_node=8\
    trainer.nnodes=1\
    trainer.val_before_train=False\
    trainer.default_local_dir=$RESULT_DIR\
    trainer.default_hdfs_dir=null\
    trainer.save_freq=20\
    trainer.test_freq=10\
    trainer.total_epochs=5 $@ 2>&1 | tee grpo.log

2. Adjusting Evaluation Parameters

Adjust the parameters in main_eval.sh according to the hyperparameters used during training.

Below is an example:

set -e -x
FILE="$(pwd)/verl/utils/reward_score/search.py"
FUNCTION_NAME="compute_score"

export MODEL_PATH='your/path/to/Qwen/Qwen3-8B'
export REWARD_MODEL_PATH=/your/path/to/huggingface.co/Qwen/QwQ-32B
export TEST_DATA='your/path/to/data/hotpot/test.parquet'
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_evaluate\
    data.val_files=$TEST_DATA\
    data.val_batch_size=2048\
    data.max_prompt_length=4096\
    data.max_response_length=512\
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.model.use_remove_padding=True\
    actor_rollout_ref.model.enable_gradient_checkpointing=True\
    actor_rollout_ref.actor.ppo_mini_batch_size=256\
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32\
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32\
    actor_rollout_ref.rollout.tensor_model_parallel_size=1\
    actor_rollout_ref.rollout.name=vllm\
    actor_rollout_ref.rollout.gpu_memory_utilization=0.9\
    actor_rollout_ref.rollout.max_turns=2\
    actor_rollout_ref.rollout.val_kwargs.temperature=0\
    actor_rollout_ref.rollout.val_kwargs.top_k=-1\
    actor_rollout_ref.rollout.val_kwargs.top_p=1\
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32\
    actor_rollout_ref.env.name=search\
    actor_rollout_ref.env.mcp_mode=stdio\
    actor_rollout_ref.env.tool_manager=null\
    actor_rollout_ref.env.enable_thinking=False\
    actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata\
    reward_rollout.if_use_reward_rollout=False\
    reward_rollout.rollout.tensor_model_parallel_size=4\
    reward_rollout.rollout.gpu_memory_utilization=0.75\
    reward_rollout.rollout.model_name=$REWARD_MODEL_PATH\
    reward_rollout.rollout.free_cache_engine=False\
    reward_rollout.rollout.response_length=2048\
    reward_model.reward_manager=parallel\
    trainer.logger=['tensorboard']\
    trainer.project_name='GRPO_search'\
    trainer.experiment_name='search_with_thinking'\
    trainer.n_gpus_per_node=8\
    trainer.nnodes=1\
    trainer.val_only=True\
    trainer.default_local_dir=ckpt\
    trainer.default_hdfs_dir=null $@ 2>&1 | tee grpo.log

3. Evaluation Results

The evaluation results will be saved in TensorBoard. Meanwhile, the input and output of the test data will be saved in both CSV and JSON formats in the directory specified by trainer.default_local_dir.

Example output:

{
  "input": "system # Tools You may call one or more functions to assist with the user query. ...",
  "output": "<tool_call> {\"name\": \"search-query_rag\", \"arguments\": {\"query\": \"when is the next step season 3 coming out\", \"topk\": 3}} </tool_call>user <tool_response> ... </tool_response> assistant <think> </think> <answer> March 16, 2015 </answer>",
  "score": 1,
  "data_source": "nq",
  "batch_index": 2,
  "sample_index": 748
}

4. Customizing the Reward Function

If the reward function used during evaluation is different from the one used during training, you can modify it in the envs directory, such as in the search.py file of the search demo. Specifically, the function will receive a boolean variable if_val, and you only need to add the new scoring method accordingly.

If the evaluation process is too complex, you can also calculate scores manually using the output JSON files.

@zero-peace zero-peace merged commit 7501638 into Simple-Efficient:main Jun 19, 2025
@zero-peace zero-peace mentioned this pull request Jun 19, 2025
Chengziha0 pushed a commit to Chengziha0/RL-Factory that referenced this pull request Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant