Skip to content

RLOO Trainer Stopping After 1 Epoch #2401

@asparius

Description

@asparius

System Info

  • Platform: Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.5
  • PyTorch version: 2.4.0
  • CUDA device(s): not available
  • Transformers version: 4.46.2
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • Datasets version: 3.1.0
  • HF Hub version: 0.26.2
  • TRL version: 0.13.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: 0.15.4
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.54.4
  • PEFT version: not installed

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

While reproducing RLOO using a multi-GPU setup with official script, training consistently halts midway, regardless of whether it's set for 1,000 or 1 million episodes. An example wandb run that ended with 1954 steps, whereas it should 3908.

Expected behavior

Should have run for 3908, or possible step miscalculation.

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions