Integrate f-divergence to DPO #1339
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related issue: #1259
reverse-kl (current default)


command: examples/scripts/dpo.py --output_dir=dpo_anthropic_hh --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 1e-5 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir=dpo_anthropic_hh --warmup_steps 150 --report_to wandb --logging_first_step --no_remove_unused_columns
alpha-divergence w/ alpha=0.5
command: examples/scripts/dpo.py --output_dir=dpo_anthropic_hh --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 1e-5 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir=dpo_anthropic_hh --warmup_steps 150 --report_to wandb --logging_first_step --no_remove_unused_columns --f_divergence_type alpha_divergence --f_alpha_divergence_coef 0.5
https://wandb.ai/open_source/huggingface/runs/b943bky2?workspace=user-1485840691





command: examples/scripts/dpo.py --output_dir=dpo_anthropic_hh --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 1e-5 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir=dpo_anthropic_hh --warmup_steps 150 --report_to wandb --logging_first_step --no_remove_unused_columns --f_divergence_type js_divergence