-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Open
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
I run the training but get this error
Reproduction
Run accelerate config
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
!git clone https://github.com/huggingface/diffusers
%cd diffusers
%cd diffusers
!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt
import huggingface_hub
huggingface_hub.notebook_login()
MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"
!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
--pretrained_model_name_or_path={MODEL_NAME} \
--instance_data_dir={INSTANCE_DIR} \
--output_dir={OUTPUT_DIR} \
--mixed_precision="bf16" \
--instance_prompt="a photo of sks girl" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="prodigy" \
--learning_rate=1. \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1000 \
--seed="0"
Logs
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
09/23/2024 14:04:51 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: bf16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 15947.92it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 17476.27it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 7.41it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 3.43it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 9446.63it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 77672.30it/s]
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 14:05:08 - INFO - __main__ - ***** Running training *****
09/23/2024 14:05:08 - INFO - __main__ - Num examples = 10
09/23/2024 14:05:08 - INFO - __main__ - Num batches each epoch = 5
09/23/2024 14:05:08 - INFO - __main__ - Num Epochs = 500
09/23/2024 14:05:08 - INFO - __main__ - Instantaneous batch size per device = 1
09/23/2024 14:05:08 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 14:05:08 - INFO - __main__ - Gradient Accumulation steps = 4
09/23/2024 14:05:08 - INFO - __main__ - Total optimization steps = 1000
Steps: 0%| | 0/1000 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Steps: 0%| | 0/1000 [00:25<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/1000 [00:26<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/1000 [00:27<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 1/1000 [00:30<8:20:36, 30.07s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/1000 [00:32<3:50:25, 13.85s/it, loss=0.456, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/1000 [00:33<3:50:25, 13.85s/it, loss=0.563, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/1000 [00:34<3:50:25, 13.85s/it, loss=0.355, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/1000 [00:36<3:50:25, 13.85s/it, loss=0.399, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 3/1000 [00:38<2:51:47, 10.34s/it, loss=0.438, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/1000 [00:40<1:58:38, 7.15s/it, loss=0.347, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/1000 [00:42<1:58:38, 7.15s/it, loss=0.585, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/1000 [00:43<1:58:38, 7.15s/it, loss=0.652, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/1000 [00:44<1:58:38, 7.15s/it, loss=0.336, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 5/1000 [00:46<1:51:42, 6.74s/it, loss=0.323, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 1%| | 6/1000 [00:49<1:28:16, 5.33s/it, loss=0.622, lr=1]Passing `txt_ids` ...
Steps: 50%|███████▍ | 499/1000 [35:19<37:43, 4.52s/it, loss=0.531, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 50%|███████▌ | 500/1000 [35:21<32:10, 3.86s/it, loss=0.531, lr=1]09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-500
09/23/2024 14:40:30 - INFO - accelerate.accelerator - Saving FSDP model
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
warnings.warn(
[rank0]:[E923 14:50:30.723437511 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank0]:[E923 14:50:30.723605561 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.752543374 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank1]:[E923 14:50:30.752732295 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965671382 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank1]:[E923 14:50:30.965698772 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E923 14:50:30.965704392 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E923 14:50:30.975447887 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0532b0e8f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0532b15333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0532b1771c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f05df318f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f05327a0a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f05e0957df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f05e4690609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f05e47ca353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E923 14:50:31.301026792 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 217211, last enqueued NCCL work: 217211, last completed NCCL work: 217210.
[rank0]:[E923 14:50:31.301055303 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 14:50:31.301060703 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 14:50:31.302001924 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=217211, OpType=_ALLGATHER_BASE, NumelIn=32062496, NumelOut=64124992, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7f6c0d18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7f6c0d8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7f6c0da71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f80188dbf86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f7f6bd63a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8019f1adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f801dc53609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f801dd8d353 in /lib/x86_64-linux-gnu/libc.so.6)
W0923 14:51:08.002416 140631144875840 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 97042 closing signal SIGTERM
W0923 14:51:38.003378 140631144875840 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 97042 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0923 14:51:44.734884 140631144875840 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 97041) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-23_14:51:08
host : x2-h100.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 97041)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 97041
======================================================
System Info
Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5a
Who can help?
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates