- Compare FSDP2 and FSDP1 - w/ TP > 1 - sequence parallel - activation checkpointing - cpu offload - target context length: 32k, llama3-8b