[update] evaluate #36

zero-peace · 2025-06-19T08:06:31Z

Readme (Tutorial)

1. Preparation: Training Hyperparameters

It is recommended to prepare the hyperparameters for training in advance.

Below is an example:

set -e -x

export MODEL_PATH=/your/path/to/huggingface.co/Qwen/Qwen3-4B
export REWARD_MODEL_PATH=/your/path/to/huggingface.co/Qwen/QwQ-32B
export RESULT_DIR=/your/path/to/results/rl_factory/your_result_dir

python3 -m verl.trainer.main_ppo\
    algorithm.adv_estimator=grpo\
    data.train_files=data/nq_search/train.parquet\
    data.val_files=data/nq_search/test.parquet\
    data.train_batch_size=128\
    data.max_prompt_length=4096\
    data.max_response_length=512\
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.model.use_remove_padding=True\
    actor_rollout_ref.model.enable_gradient_checkpointing=True\
    actor_rollout_ref.actor.optim.lr=1e-6\
    actor_rollout_ref.actor.ppo_mini_batch_size=32\
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16\
    actor_rollout_ref.actor.use_kl_loss=True\
    actor_rollout_ref.actor.kl_loss_coef=0.001\
    actor_rollout_ref.actor.kl_loss_type=low_var_kl\
    actor_rollout_ref.actor.fsdp_config.param_offload=True\
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True\
    actor_rollout_ref.actor.state_masking=True\
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16\
    actor_rollout_ref.rollout.tensor_model_parallel_size=1\
    actor_rollout_ref.rollout.name=vllm\
    actor_rollout_ref.rollout.gpu_memory_utilization=0.75\
    actor_rollout_ref.rollout.n=4\
    actor_rollout_ref.rollout.max_turns=2\
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16\
    actor_rollout_ref.ref.fsdp_config.param_offload=False\
    actor_rollout_ref.rollout.enforce_eager=False\
    actor_rollout_ref.rollout.free_cache_engine=False\
    actor_rollout_ref.env.name=search\
    actor_rollout_ref.env.mcp_mode=stdio\
    actor_rollout_ref.env.tool_manager=qwen3\
    actor_rollout_ref.env.enable_thinking=False\
    actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata\
    actor_rollout_ref.env.use_process_reward=False\
    reward_rollout.if_use_reward_rollout=False\
    reward_rollout.rollout.tensor_model_parallel_size=4\
    reward_rollout.rollout.gpu_memory_utilization=0.65\
    reward_rollout.rollout.model_name=$REWARD_MODEL_PATH\
    reward_rollout.rollout.free_cache_engine=False\
    reward_rollout.rollout.response_length=2048\
    reward_model.reward_manager=parallel\
    algorithm.kl_ctrl.kl_coef=0.001\
    trainer.critic_warmup=0\
    trainer.logger=['tensorboard']\
    trainer.project_name='GRPO_search'\
    trainer.experiment_name='search_with_thinking'\
    trainer.n_gpus_per_node=8\
    trainer.nnodes=1\
    trainer.val_before_train=False\
    trainer.default_local_dir=$RESULT_DIR\
    trainer.default_hdfs_dir=null\
    trainer.save_freq=20\
    trainer.test_freq=10\
    trainer.total_epochs=5 $@ 2>&1 | tee grpo.log

2. Adjusting Evaluation Parameters

Adjust the parameters in main_eval.sh according to the hyperparameters used during training.

Below is an example:

set -e -x
FILE="$(pwd)/verl/utils/reward_score/search.py"
FUNCTION_NAME="compute_score"

export MODEL_PATH='your/path/to/Qwen/Qwen3-8B'
export REWARD_MODEL_PATH=/your/path/to/huggingface.co/Qwen/QwQ-32B
export TEST_DATA='your/path/to/data/hotpot/test.parquet'
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_evaluate\
    data.val_files=$TEST_DATA\
    data.val_batch_size=2048\
    data.max_prompt_length=4096\
    data.max_response_length=512\
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.model.use_remove_padding=True\
    actor_rollout_ref.model.enable_gradient_checkpointing=True\
    actor_rollout_ref.actor.ppo_mini_batch_size=256\
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32\
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32\
    actor_rollout_ref.rollout.tensor_model_parallel_size=1\
    actor_rollout_ref.rollout.name=vllm\
    actor_rollout_ref.rollout.gpu_memory_utilization=0.9\
    actor_rollout_ref.rollout.max_turns=2\
    actor_rollout_ref.rollout.val_kwargs.temperature=0\
    actor_rollout_ref.rollout.val_kwargs.top_k=-1\
    actor_rollout_ref.rollout.val_kwargs.top_p=1\
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32\
    actor_rollout_ref.env.name=search\
    actor_rollout_ref.env.mcp_mode=stdio\
    actor_rollout_ref.env.tool_manager=null\
    actor_rollout_ref.env.enable_thinking=False\
    actor_rollout_ref.env.config_path=envs/configs/mcp_tools.pydata\
    reward_rollout.if_use_reward_rollout=False\
    reward_rollout.rollout.tensor_model_parallel_size=4\
    reward_rollout.rollout.gpu_memory_utilization=0.75\
    reward_rollout.rollout.model_name=$REWARD_MODEL_PATH\
    reward_rollout.rollout.free_cache_engine=False\
    reward_rollout.rollout.response_length=2048\
    reward_model.reward_manager=parallel\
    trainer.logger=['tensorboard']\
    trainer.project_name='GRPO_search'\
    trainer.experiment_name='search_with_thinking'\
    trainer.n_gpus_per_node=8\
    trainer.nnodes=1\
    trainer.val_only=True\
    trainer.default_local_dir=ckpt\
    trainer.default_hdfs_dir=null $@ 2>&1 | tee grpo.log

3. Evaluation Results

The evaluation results will be saved in TensorBoard. Meanwhile, the input and output of the test data will be saved in both CSV and JSON formats in the directory specified by trainer.default_local_dir.

Example output:

{
  "input": "system # Tools You may call one or more functions to assist with the user query. ...",
  "output": "<tool_call> {\"name\": \"search-query_rag\", \"arguments\": {\"query\": \"when is the next step season 3 coming out\", \"topk\": 3}} </tool_call>user <tool_response> ... </tool_response> assistant <think> </think> <answer> March 16, 2015 </answer>",
  "score": 1,
  "data_source": "nq",
  "batch_index": 2,
  "sample_index": 748
}

4. Customizing the Reward Function

If the reward function used during evaluation is different from the one used during training, you can modify it in the envs directory, such as in the search.py file of the search demo. Specifically, the function will receive a boolean variable if_val, and you only need to add the new scoring method accordingly.

If the evaluation process is too complex, you can also calculate scores manually using the output JSON files.

[update] evaluate

[update] evaluate

61a6c84

zero-peace merged commit 7501638 into Simple-Efficient:main Jun 19, 2025

zero-peace mentioned this pull request Jun 19, 2025

inference script #27

Open

Chengziha0 pushed a commit to Chengziha0/RL-Factory that referenced this pull request Sep 3, 2025

[update] evaluate Simple-Efficient#36

319f447

[update] evaluate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[update] evaluate #36

[update] evaluate #36

Uh oh!

zero-peace commented Jun 19, 2025

Uh oh!

Uh oh!

[update] evaluate #36

[update] evaluate #36

Uh oh!

Conversation

zero-peace commented Jun 19, 2025

Readme (Tutorial)

1. Preparation: Training Hyperparameters

2. Adjusting Evaluation Parameters

3. Evaluation Results

4. Customizing the Reward Function

Uh oh!

Uh oh!