-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Open
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.
Reproduction
Run accelerate config
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
!git clone https://github.com/huggingface/diffusers
%cd diffusers
%cd diffusers
!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt
import huggingface_hub
huggingface_hub.notebook_login()
MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"
!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
--pretrained_model_name_or_path={MODEL_NAME} \
--instance_data_dir={INSTANCE_DIR} \
--output_dir={OUTPUT_DIR} \
--mixed_precision="bf16" \
--instance_prompt="a photo of sks girl" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="prodigy" \
--learning_rate=1. \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2 \
--seed="0" \
--checkpointing_steps=9999999999999999
Logs
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: bf16
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 16225.55it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 17623.13it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 8.39it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 7033.49it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 2.42it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 63230.71it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 15:02:30 - INFO - __main__ - ***** Running training *****
09/23/2024 15:02:30 - INFO - __main__ - Num examples = 10
09/23/2024 15:02:30 - INFO - __main__ - Num batches each epoch = 5
09/23/2024 15:02:30 - INFO - __main__ - Num Epochs = 1
09/23/2024 15:02:30 - INFO - __main__ - Instantaneous batch size per device = 1
09/23/2024 15:02:30 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 15:02:30 - INFO - __main__ - Gradient Accumulation steps = 4
09/23/2024 15:02:30 - INFO - __main__ - Total optimization steps = 2
Steps: 0%| | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Steps: 0%| | 0/2 [00:23<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/2 [00:24<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/2 [00:25<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 50%|██████████ | 1/2 [00:28<00:28, 28.25s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 100%|████████████████████| 2/2 [00:30<00:00, 13.05s/it, loss=0.456, lr=1]
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 43%|█████▌ | 3/7 [00:00<00:00, 19.11it/s]Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████████ | 1/2 [00:00<00:00, 2.69it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 2.87it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 71%|█████████▎ | 5/7 [00:00<00:00, 4.63it/s]Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 86%|███████████▏ | 6/7 [00:01<00:00, 5.05it/s]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|█████████████| 7/7 [00:01<00:00, 6.26it/s]
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/config.json
Model weights saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/diffusion_pytorch_model.safetensors
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/transformer/config.json
[rank0]:[E923 15:13:12.047707465 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank0]:[E923 15:13:12.047923067 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598657835 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598687105 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 15:13:13.598692476 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 15:13:13.599794925 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fe885173a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)
E0923 15:13:20.499235 139957517330240 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 127496) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-23_15:13:20
host : x2-h100.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 127496)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 127496
=======================================================
System Info
Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5aebbac0cc18168076a18836f175753d1c7x
Who can help?
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates