-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
System Info
- `Accelerate` version: 0.14.0.dev0
- Platform: Linux-5.4.0-128-generic-x86_64-with-glibc2.17
- Python version: 3.8.12
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: fp16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {'fsdp_auto_wrap_policy': 'NO_WRAP', 'fsdp_backward_prefetch_policy': 'NO_PREFETCH', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT'}
- downcast_bf16: no
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
Running into some issues when freezing weights doing multi-GPU training using FSDP.
I've tried preparing my model before and after freezing the weights, both with different and equally disappointing results.
def freeze_layers(model, to_freeze, verbose=True):
for i, (name, param) in enumerate(model.named_parameters()):
freeze = '--> freeze' if i in to_freeze else ''
if verbose:
print(i, name, ' '*(45 - len(name)-len(str(i))), freeze) # >>> layer_name ---> freeze
if freeze:
param.requires_grad = False
Preparing before:
accelerator = Accelerator()
device = accelerator.device
# init model
model = load_model(args.model).to(device)
model = accelerator.prepare(model)
# freeze layers
to_freeze = [0, 1, 2, 3]
freeze_layers(model, to_freeze)
freeze_layers
prints this to console:
0 _fsdp_wrapped_module.flat_param --> freeze
Preparing after:
accelerator = Accelerator()
device = accelerator.device
# init model
model = load_model(args.model).to(device)
# freeze layers 0 to 3
to_freeze = [0, 1, 2, 3]
freeze_layers(model, to_freeze)
# wait for all processes to freeze the desired layers
accelerate.wait_for_everyone()
model = accelerate.prepare(model)
I get this error:
Traceback (most recent call last):
File "train.py", line 673, in <module>
main(config, output_dir, args)
File "train.py", line 636, in main
TA = TrainAgent(config, output_dir, args)
File "train.py", line 112, in __init__
self.model = self.accelerator.prepare(self.model)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 681, in prepare
result = tuple(
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 682, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 556, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 731, in prepare_model
model = FSDP(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 814, in __init__
self._fsdp_wrapped_module: FlattenParamsWrapper = FlattenParamsWrapper(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 319, in __init__
params, param_infos, shared_param_infos = self._init_flatten_params()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 370, in _init_flatten_params
assert (
AssertionError: expects all parameters to have same requires_grad
Any help would be much appreciated :)
Expected behavior
The specified model layers of each respective process should be set to requires grad=False
Metadata
Metadata
Assignees
Labels
No labels