Skip to content

Conversation

ananyahjha93
Copy link
Contributor

@ananyahjha93 ananyahjha93 commented May 30, 2024

Added tests (CPU and GPU) to compare torch clipping and olmo clipping and fixed clipping for DDP and FSDP no_shard

@epwalsh can we merge this PR so that I can push the DDP one after this?

@ananyahjha93 ananyahjha93 requested review from epwalsh and dirkgr May 30, 2024 09:50
@ananyahjha93 ananyahjha93 merged commit 55c1e2f into main May 30, 2024
@ananyahjha93 ananyahjha93 deleted the no_shard_ddp_clip branch May 30, 2024 13:40
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananyahjha93 please don't merge anything into main without a review. This PR touches some critical components and I'm pretty sure your changes will force an unnecessary host device sync on every batch, which could potentially have a big negative impact on training throughput.

Comment on lines -214 to +222
group, max_norm_ratio, global_step, all_metrics, collect_param_metrics=collect_param_metrics
group, max_norm_ratio, global_step, all_metrics, collect_param_metrics=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? This will force a host device sync on every batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me send in a PR for this fix!

Comment on lines -218 to +226
group, max_norm, all_metrics, collect_param_metrics=collect_param_metrics
group, max_norm, all_metrics, collect_param_metrics=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants