in SFT script, distributed training got stuck if set `packing=false`

### Reproduction

I tried running SFT experiments using trl. However, I find that if setting packing = false . Then the process will get stuck at the beginning of the training.


<img width="1209" alt="Image" src="https://github.com/user-attachments/assets/f1a0dd0b-0ab0-4d2f-b53b-575d9a1615d2" />

This is ok in a single GPU setting. SO I assume it must be correlated with NCCL communication

Any ideas why this happens?



Also if  set `per_device_train_batch_size=1` would get this error
```


wandb: ⭐️ View project at https://wandb.ai/noteam2235/huggingface
wandb: 🚀 View run at https://wandb.ai/noteam2235/huggingface/runs/b11ld1bd
  0%|                                                                                                                                               | 0/9 [00:00<?, ?it/s]W0217 07:42:20.851000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471406 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471407 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471408 closing signal SIGTERM
E0217 07:42:21.080000 470998 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 471409) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1157, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
```

### System Info

Name: trl
Version: 0.16.0.dev0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

in SFT script, distributed training got stuck if set `packing=false` #2879

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

in SFT script, distributed training got stuck if set packing=false #2879

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

in SFT script, distributed training got stuck if set `packing=false` #2879