Skip to content

in SFT script, distributed training got stuck if set packing=false #2879

@ChenDRAG

Description

@ChenDRAG

Reproduction

I tried running SFT experiments using trl. However, I find that if setting packing = false . Then the process will get stuck at the beginning of the training.

Image

This is ok in a single GPU setting. SO I assume it must be correlated with NCCL communication

Any ideas why this happens?

Also if set per_device_train_batch_size=1 would get this error



wandb: ⭐️ View project at https://wandb.ai/noteam2235/huggingface
wandb: 🚀 View run at https://wandb.ai/noteam2235/huggingface/runs/b11ld1bd
  0%|                                                                                                                                               | 0/9 [00:00<?, ?it/s]W0217 07:42:20.851000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471406 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471407 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471408 closing signal SIGTERM
E0217 07:42:21.080000 470998 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 471409) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1157, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

System Info

Name: trl
Version: 0.16.0.dev0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    ⚡accelerateRelated to accelerate🏋 SFTRelated to SFT🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions