FSDP state dict OOM during model saving

### 🐛 Describe the bug

see related user reporting issues in https://github.com/tatsu-lab/stanford_alpaca/issues/81 and https://github.com/lm-sys/FastChat/issues/256

A workaround that the community is applying is:

>Assume you are using torch=1.13.0, change python/lib/python3.9/site packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()`

This is pretty manual monkey patching and we should really fix this in pytorch directly. 

@fegin @awgu @rohan-varma @zhaojuanmao 


### Versions

This happens since pytorch 1.13 and I don't think we have fixed it so far.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP state dict OOM during model saving #98823

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP state dict OOM during model saving #98823

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions