Skip to content

Dreambooth Flux training does not save a model for around 10-15 minutes #9501

@kopyl

Description

@kopyl

Describe the bug

This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.

Reproduction

Run accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
!git clone https://github.com/huggingface/diffusers
%cd diffusers

%cd diffusers

!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt

import huggingface_hub
huggingface_hub.notebook_login()

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2 \
  --seed="0" \
  --checkpointing_steps=9999999999999999

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/23/2024 15:02:12 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 16225.55it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 17623.13it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  8.39it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 7033.49it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.42it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 63230.71it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 15:02:30 - INFO - __main__ - ***** Running training *****
09/23/2024 15:02:30 - INFO - __main__ -   Num examples = 10
09/23/2024 15:02:30 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 15:02:30 - INFO - __main__ -   Num Epochs = 1
09/23/2024 15:02:30 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 15:02:30 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 15:02:30 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 15:02:30 - INFO - __main__ -   Total optimization steps = 2
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                              | 0/2 [00:23<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:24<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [00:25<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|██████████          | 1/2 [00:28<00:28, 28.25s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 100%|████████████████████| 2/2 [00:30<00:00, 13.05s/it, loss=0.456, lr=1]
Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  43%|█████▌       | 3/7 [00:00<00:00, 19.11it/s]Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.


Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:  50%|█████████         | 1/2 [00:00<00:00,  2.69it/s]

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.87it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  71%|█████████▎   | 5/7 [00:00<00:00,  4.63it/s]Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  86%|███████████▏ | 6/7 [00:01<00:00,  5.05it/s]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|█████████████| 7/7 [00:01<00:00,  6.26it/s]
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/config.json
Model weights saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/vae/diffusion_pytorch_model.safetensors
Configuration saved in /flux-dreambooth-outputs/dreamboot-yaremovaa/transformer/config.json
[rank0]:[E923 15:13:12.047707465 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank0]:[E923 15:13:12.047923067 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598657835 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1080, last enqueued NCCL work: 1080, last completed NCCL work: 1079.
[rank0]:[E923 15:13:13.598687105 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E923 15:13:13.598692476 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E923 15:13:13.599794925 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1080, OpType=_ALLGATHER_BASE, NumelIn=169915648, NumelOut=339831296, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fe8854e18f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe8854e8333 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe8854ea71c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe9330c6f86 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fe885173a84 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fe93330edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fe93704e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fe937188353 in /lib/x86_64-linux-gnu/libc.so.6)

E0923 15:13:20.499235 139957517330240 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 127496) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_15:13:20
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 127496)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 127496
=======================================================

System Info

Ubuntu 20.04
x2 NVIDIA H100
CUDA 12.2
torch==2.4.1
torchvision==0.19.1
Diffusers commit: ba5af5aebbac0cc18168076a18836f175753d1c7x

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions