(cc @syrn1k, author of #1593) In the paper, they seem to recommend α = 0.6, τ = 512 <img width="902" alt="Screenshot 2024-08-28 at 17 58 11" src="https://github.com/user-attachments/assets/16016b14-ed80-4090-a243-bedfd0446c70"> <img width="902" alt="Screenshot 2024-08-28 at 17 58 29" src="https://github.com/user-attachments/assets/a39aff74-9d61-4c50-bd3d-ede61482043e"> while in trl, we've α = 0.9, τ = 64 https://github.com/huggingface/trl/blob/10f70fa3337826ffb8c2e0eb0de00051ea53563b/trl/trainer/dpo_config.py#L143-L144