-
Notifications
You must be signed in to change notification settings - Fork 475
Direct Preference Optimization #530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a LOT @maxjeblick
Very high quality PR, and it works flawless on many different setups and datasets that I tested.
One thing, that we should change in the future is the dataset import. Currently, only the default settings for Causal modeling can be set during import, so one always needs to change that when starting an experiment in e.g. DPO training.
Also, rewards are logged, but never displayed (only when using neptune) -> potential good new feature as you mentioned.
While still not easy to get better results over standard fine-tuning, as DPO is way more user friendly, I am rooting for fully replacing RLHF (by PPO) with it in a subsequent PR.
Minor change needed:
- Rejected Answer Column is missing a tooltip
experiment_name: str = field(default_factory=generate_experiment_name) | ||
_parent_experiment: str = "" | ||
# 7b model may be unstable (NaN loss) | ||
llm_backbone: str = "h2oai/h2ogpt-4096-llama2-13b-chat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick:
Maybe, we should replace by a "h2ogpt-gm" fine-tune that already has the same prompting style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a specific model in mind? I could also change the default prompt style values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably just change default values to something that works. Such as using mistralai/Mistral-7B-Instruct-v0.1
and its prompting style.
This PR adds DPO (https://github.com/eric-mitchell/direct-preference-optimization) as a new problem type.
Also, IPO can be selected via the associated loss function.
Apart from adding the new problem type, the following changes have been made:
Possible follow-up work: