-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[hybrid bug fix] Fix mp multi gradient clip prob #35713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hybrid bug fix] Fix mp multi gradient clip prob #35713
Conversation
Thanks for your contribution! |
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加上开MP和不开MP时,混合并行精度的对比
5e1fe33
to
7ebecff
Compare
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Show resolved
Hide resolved
# Therefore, we prune those duplicated vars for grad clip. | ||
if mp_rank > 0 and (not (hasattr(input_var, 'is_distributed') | ||
and input_var.is_distributed)): | ||
removed_op_idx.append(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实当前这种删法是不严谨的,按照之前global_clip的square、sum op实现会出问题。不过目前改成了squarel2norm实现,不会出问题就是了。
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Show resolved
Hide resolved
a9763af
to
ee1eff2
Compare
python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py
Outdated
Show resolved
Hide resolved
70baf73
to
d9b6d50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
'cast', 'sum', 'fill_constant', 'cast', 'sum', 'fill_constant', | ||
'cast', 'sum', 'c_sync_comm_stream', 'check_finite_and_unscale', | ||
'cast', 'c_allreduce_max', 'c_allreduce_max', 'cast', | ||
'update_loss_scaling', 'fill_constant', 'c_allreduce_sum', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👻
OP_ROLE_KEY: OpRole.Optimize, | ||
}) | ||
return | ||
for idx, op in list(enumerate(block.ops)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实反向遍历更好点点
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Others
Describe
mp下,有些var并不是分布式的,譬如scale、bias等等。这些var如果在GradientClipByGlobalNorm时候在各路mp均进行累加,那么最后的global norm会略大一些。

修改后,对于is_distributed为False的grad,只在mp_rank为0的节点进行累加计算Norm,后续通过c_allreduce_sum来计算GlobalNorm
修改前mp_rank=1 program:
修改后mp_rank=1 program:
