Bug in example DPO script in dataloading

Since the example [DPO script](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py) uses [hh-rlhf dataset](https://huggingface.co/datasets/trl-internal-testing/hh-rlhf-trl-style) in OpenAI messages format, the loading in the script [here](https://github.com/huggingface/trl/blob/e823458a6a793778b959a1c134cd2ee3eaa9a9bd/examples/scripts/dpo.py#L148) seems incorrect:

```
    def process(row):
        row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
        row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
        return row
```

since it adds all messages to both chosen and rejected. But it also ignores the prompt template for the prompt.
If my understanding is correct the right process function would be

```
    def process(row):
        # we should extract the final turn of messages to define chosen/rejected responses and keep the rest as prompt
        prompt_messages = row["chosen"][:-1]
        chosen_messages = row["chosen"][-1:]
        rejected_messages = row["rejected"][-1:]

        row["prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
        row["chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
        row["rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
        return row
```
As far as i see only the answer is expected in the chosen / rejected parts in the DPO trainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in example DPO script in dataloading #1541

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in example DPO script in dataloading #1541

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions