-
Notifications
You must be signed in to change notification settings - Fork 654
Fixes clipping #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes clipping #601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananyahjha93 please don't merge anything into main without a review. This PR touches some critical components and I'm pretty sure your changes will force an unnecessary host device sync on every batch, which could potentially have a big negative impact on training throughput.
| group, max_norm_ratio, global_step, all_metrics, collect_param_metrics=collect_param_metrics | ||
| group, max_norm_ratio, global_step, all_metrics, collect_param_metrics=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? This will force a host device sync on every batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me send in a PR for this fix!
| group, max_norm, all_metrics, collect_param_metrics=collect_param_metrics | ||
| group, max_norm, all_metrics, collect_param_metrics=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
Added tests (CPU and GPU) to compare torch clipping and olmo clipping and fixed clipping for DDP and FSDP no_shard
@epwalsh can we merge this PR so that I can push the DDP one after this?