Skip to content

How do I freeze weights when using FSDP? #807

@antopost

Description

@antopost

System Info

- `Accelerate` version: 0.14.0.dev0
- Platform: Linux-5.4.0-128-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: None
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {'fsdp_auto_wrap_policy': 'NO_WRAP', 'fsdp_backward_prefetch_policy': 'NO_PREFETCH', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT'}
        - downcast_bf16: no

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Running into some issues when freezing weights doing multi-GPU training using FSDP.
I've tried preparing my model before and after freezing the weights, both with different and equally disappointing results.

def freeze_layers(model, to_freeze, verbose=True):
    for i, (name, param) in enumerate(model.named_parameters()):
        freeze = '--> freeze' if i in to_freeze else ''
        if verbose:
            print(i, name, ' '*(45 - len(name)-len(str(i))), freeze)    # >>> layer_name   ---> freeze
        if freeze:
            param.requires_grad = False

Preparing before:

accelerator = Accelerator()
device = accelerator.device

# init model
model = load_model(args.model).to(device)
model = accelerator.prepare(model)

# freeze layers
to_freeze = [0, 1, 2, 3]
freeze_layers(model, to_freeze)

freeze_layers prints this to console:

0 _fsdp_wrapped_module.flat_param --> freeze

Preparing after:

accelerator = Accelerator()
device = accelerator.device

# init model
model = load_model(args.model).to(device)

# freeze layers 0 to 3
to_freeze = [0, 1, 2, 3]
freeze_layers(model, to_freeze)

# wait for all processes to freeze the desired layers
accelerate.wait_for_everyone()
model = accelerate.prepare(model)

I get this error:

Traceback (most recent call last):
  File "train.py", line 673, in <module>
    main(config, output_dir, args)
  File "train.py", line 636, in main
    TA = TrainAgent(config, output_dir, args)
  File "train.py", line 112, in __init__
    self.model = self.accelerator.prepare(self.model)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 681, in prepare
    result = tuple(
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 682, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 556, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 731, in prepare_model
    model = FSDP(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 814, in __init__
    self._fsdp_wrapped_module: FlattenParamsWrapper = FlattenParamsWrapper(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 319, in __init__
    params, param_infos, shared_param_infos = self._init_flatten_params()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 370, in _init_flatten_params
    assert (
AssertionError: expects all parameters to have same requires_grad

Any help would be much appreciated :)

Expected behavior

The specified model layers of each respective process should be set to requires grad=False

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions