generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
⚡accelerateRelated to accelerateRelated to accelerate🏋 SFTRelated to SFTRelated to SFT🐛 bugSomething isn't workingSomething isn't working
Description
Reproduction
I tried running SFT experiments using trl. However, I find that if setting packing = false . Then the process will get stuck at the beginning of the training.
This is ok in a single GPU setting. SO I assume it must be correlated with NCCL communication
Any ideas why this happens?
Also if set per_device_train_batch_size=1
would get this error
wandb: ⭐️ View project at https://wandb.ai/noteam2235/huggingface
wandb: 🚀 View run at https://wandb.ai/noteam2235/huggingface/runs/b11ld1bd
0%| | 0/9 [00:00<?, ?it/s]W0217 07:42:20.851000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471406 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471407 closing signal SIGTERM
W0217 07:42:20.852000 470998 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 471408 closing signal SIGTERM
E0217 07:42:21.080000 470998 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 471409) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1157, in launch_command
deepspeed_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
System Info
Name: trl
Version: 0.16.0.dev0
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete
jonnyli1125
Metadata
Metadata
Assignees
Labels
⚡accelerateRelated to accelerateRelated to accelerate🏋 SFTRelated to SFTRelated to SFT🐛 bugSomething isn't workingSomething isn't working