Direct Preference Optimization #530

maxjeblick · 2023-12-06T14:00:53Z

This PR adds DPO (https://github.com/eric-mitchell/direct-preference-optimization) as a new problem type.
Also, IPO can be selected via the associated loss function.

Apart from adding the new problem type, the following changes have been made:

A new dataset for DPO is loaded by default https://huggingface.co/datasets/Intel/orca_dpo_pairs
Code to create HH DPO dataset compatible to llm studio format has been added
Default Dataset creation will now show a progress pop up that informs the user
Some small refactoring in various places

Possible follow-up work:

Add better insights; Validation Insights and Train Data Insights do not show the rejected answer. I haven't added it to make the PR slimmer.
Add Reward margin plot to charts.
Add problem type dropdown selection to dataset import

pascal-pfeiffer

Thanks a LOT @maxjeblick

Very high quality PR, and it works flawless on many different setups and datasets that I tested.
One thing, that we should change in the future is the dataset import. Currently, only the default settings for Causal modeling can be set during import, so one always needs to change that when starting an experiment in e.g. DPO training.

Also, rewards are logged, but never displayed (only when using neptune) -> potential good new feature as you mentioned.

While still not easy to get better results over standard fine-tuning, as DPO is way more user friendly, I am rooting for fully replacing RLHF (by PPO) with it in a subsequent PR.

Minor change needed:

Rejected Answer Column is missing a tooltip

pascal-pfeiffer · 2023-12-07T17:38:49Z

llm_studio/python_configs/text_dpo_modeling_config.py

+    experiment_name: str = field(default_factory=generate_experiment_name)
+    _parent_experiment: str = ""
+    # 7b model may be unstable (NaN loss)
+    llm_backbone: str = "h2oai/h2ogpt-4096-llama2-13b-chat"


Nitpick:
Maybe, we should replace by a "h2ogpt-gm" fine-tune that already has the same prompting style.

Do you have a specific model in mind? I could also change the default prompt style values.

Yes, probably just change default values to something that works. Such as using mistralai/Mistral-7B-Instruct-v0.1 and its prompting style.

This reverts commit 38eecb7.

This reverts commit 5f1f0a0.

This reverts commit 2ab87a2.

This reverts commit 4a4ead9.

maxjeblick added 30 commits December 4, 2023 10:42

add dpo problem type

987886f

add tests for dataset

b38d133

fix dpo dataset issues

7ea0178

add patch decorator

1a86928

use patch decorator in various places

5cf6b07

fix typos

5919ab8

add progress dialog for datasets downloads

fa5332f

various fixes

ac1f914

fix None issue

5f4993a

fix format

bb442d5

fix format

c4c8e9b

refactor dataset

b8a80bc

remove too restrictive sanity check

c7010d7

some refactoring

c260c52

some refactoring

2f7e8f3

fix perplexity issue

1a8a936

include missing validation prediction plots

2ec4345

remove unneeded patch context manager

8614ac1

change default dpo dataset

a662aa0

add antropic dataset

2668e7f

better logging

aabf8c1

debugging

b526090

fix some issues wrt empty answers

fe2d19a

remove hardcoded llm in test

70eb5fb

add tests for generation

71be83b

add tests for generation

b64df0e

add tests for generation

5868bfb

add tests for generation

2ea2b2d

add tooltips

0392334

fix typo

dd2bfe4

maxjeblick added 4 commits December 6, 2023 13:47

minor edits

e445074

fix typo

ccbe857

improce doctest

f38663a

add comment

0863547

maxjeblick requested a review from pascal-pfeiffer December 6, 2023 14:00

maxjeblick and others added 4 commits December 6, 2023 15:22

update readme with pr number

2af364d

show answer as well

87de624

fix missing variable

13cb6ef

Merge branch 'main' into max/dpo2

806d90d

pascal-pfeiffer requested changes Dec 7, 2023

View reviewed changes

maxjeblick and others added 2 commits December 11, 2023 10:07

Merge branch 'main' into max/dpo2

3b8bdfd

add tooltip for rejected answer

7849fe4

maxjeblick requested a review from pascal-pfeiffer December 11, 2023 14:56

maxjeblick added 8 commits December 11, 2023 17:03

add mistral as default model

4a4ead9

fix failing tests

2ab87a2

fix failing tests

5f1f0a0

change system start

38eecb7

Revert "change system start"

682d75e

This reverts commit 38eecb7.

Revert "fix failing tests"

cabdb3d

This reverts commit 5f1f0a0.

Revert "fix failing tests"

ab0679f

This reverts commit 2ab87a2.

Revert "add mistral as default model"

92e516e

This reverts commit 4a4ead9.

pascal-pfeiffer approved these changes Dec 14, 2023

View reviewed changes

maxjeblick merged commit db63693 into main Dec 14, 2023

maxjeblick deleted the max/dpo2 branch December 14, 2023 09:04

pascal-pfeiffer mentioned this pull request Dec 20, 2023

[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

Closed

psinger mentioned this pull request Jan 15, 2024

[FEATURE] Add DPO optimization #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Direct Preference Optimization #530

Direct Preference Optimization #530

Uh oh!

maxjeblick commented Dec 6, 2023

Uh oh!

pascal-pfeiffer left a comment •

edited

Loading

Uh oh!

pascal-pfeiffer Dec 7, 2023

Uh oh!

maxjeblick Dec 11, 2023

Uh oh!

pascal-pfeiffer Dec 11, 2023

Uh oh!

Uh oh!

Direct Preference Optimization #530

Direct Preference Optimization #530

Uh oh!

Conversation

maxjeblick commented Dec 6, 2023

Uh oh!

pascal-pfeiffer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pascal-pfeiffer Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

maxjeblick Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

pascal-pfeiffer Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pascal-pfeiffer left a comment •

edited

Loading