generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
🏋 DPORelated to DPORelated to DPO
Description
Since the example DPO script uses hh-rlhf dataset in OpenAI messages format, the loading in the script here seems incorrect:
def process(row):
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row
since it adds all messages to both chosen and rejected. But it also ignores the prompt template for the prompt.
If my understanding is correct the right process function would be
def process(row):
# we should extract the final turn of messages to define chosen/rejected responses and keep the rest as prompt
prompt_messages = row["chosen"][:-1]
chosen_messages = row["chosen"][-1:]
rejected_messages = row["rejected"][-1:]
row["prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
row["chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
return row
As far as i see only the answer is expected in the chosen / rejected parts in the DPO trainer.
oKatanaaa, Yuancheng-Xu, wxjiao and AIR-hl
Metadata
Metadata
Assignees
Labels
🏋 DPORelated to DPORelated to DPO