Cannot run SFTTrainer with tokenized data after updating TRL.

### Reproduction

The code worked fine before updating.
After performing a binary search to identify which commit caused this issue, I found that it was commit [5b9236d](https://github.com/huggingface/trl/commit/5b9236d1e8d062ad76f088b0730e36c724ec170c#diff-50127f999dc1704f427cc7dfa5284f57b4546a30204a3b36371b9adac10e90a0)

For now, I'm using this workaround to fix the problem: `pip install -U git+https://github.com/huggingface/trl.git@82d12eb75103821cd4af1978e99b1026a90ac67d`
 
Installing either the latest git or 0.15.0 will break the code.

Minimal code to reproduce the error
```python
import torch
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import Dataset

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = Dataset.from_list([{
    "input_ids": torch.zeros(2, dtype=torch.int32),
    "attention_mask": torch.zeros(2, dtype=torch.int8),
    "labels": torch.zeros(2, dtype=torch.int32)
} for _ in range(16)])

print(f"{dataset=}")
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto", 
    use_cache=False,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

# Configure LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules="all-linear",
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)


training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    logging_steps=1,
    bf16=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Start training
trainer.model.print_trainable_parameters()
trainer.train()
```

outputs:

```
root@813ecfcb235b:/workspaces/LLMTrain# /opt/conda/bin/python /workspaces/LLMTrain/finetune_example.py
tokenizer.eos_token_id=151645
tokenizer.pad_token_id=151643
dataset=Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 20480
})
Applied Liger kernels to Qwen2
Applying chat template to train dataset: 100%|████████████████████| 20480/20480 [00:02<00:00, 9587.95 examples/s]
Tokenizing train dataset:   0%|                                                 | 0/20480 [00:00<?, ? examples/s]
Traceback (most recent call last):
  File "/workspaces/LLMTrain/finetune_example.py", line 110, in <module>
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 198, in __init__
    train_dataset = self._prepare_dataset(
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 411, in _prepare_dataset
    dataset = dataset.map(lambda ex: processing_class(ex[args.dataset_text_field]), **map_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3073, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3446, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3338, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 411, in <lambda>
    dataset = dataset.map(lambda ex: processing_class(ex[args.dataset_text_field]), **map_kwargs)
                                                      ~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 277, in __getitem__
    value = self.data[key]
            ~~~~~~~~~^^^^^
KeyError: 'text'
```


### System Info

- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.11.10
- PyTorch version: 2.5.1+cu124
- CUDA device(s): NVIDIA RTX 6000 Ada Generation
- Transformers version: 4.49.0.dev0
- Accelerate version: 1.4.0.dev0
- Accelerate config: not found
- Datasets version: 3.2.0
- HF Hub version: 0.28.1
- TRL version: 0.15.0.dev0
- bitsandbytes version: 0.45.2
- DeepSpeed version: not installed
- Diffusers version: not installed
- Liger-Kernel version: 0.5.3
- LLM-Blender version: not installed
- OpenAI version: 1.63.0
- PEFT version: 0.14.1.dev0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot run SFTTrainer with tokenized data after updating TRL. #2861

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot run SFTTrainer with tokenized data after updating TRL. #2861

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions