Skip to content

Unnecessary breaking change in SFTTrainer._prepare_dataset from 0.19.0 compared to 0.18.2 #3641

@jannisborn

Description

@jannisborn

Reproduction

In #3572 @qgallouedec simplified processing conversational data.
However it also alters the interaction with the tokenizer because it changes from accessing it in a item access (processed["input_ids"]) to an attribute access (processed.input_ids). processed is an output of the tokenizer. But the tokenizer is not necessarily under the control of the library since it is user-provided and may be custom.

Is this an intentional breaking change? If yes, why? It forces users to write their tokenizers to return BatchEncoding rather than plain dicts.

This PR was merged between 0.18.2 and 0.19.0

I am referring to this line:

prompt_ids = processing_class(text=example["prompt"]).input_ids

System Info

v 0.19.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 SFTRelated to SFT🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions