[hybrid bug fix] Fix mp multi gradient clip prob #35713

FeixLiu · 2021-09-14T04:03:21Z

PR types

Bug fixes

PR changes

Others

Describe

mp下，有些var并不是分布式的，譬如scale、bias等等。这些var如果在GradientClipByGlobalNorm时候在各路mp均进行累加，那么最后的global norm会略大一些。
修改后，对于is_distributed为False的grad，只在mp_rank为0的节点进行累加计算Norm，后续通过c_allreduce_sum来计算GlobalNorm
修改前mp_rank=1 program:

修改后mp_rank=1 program:

paddle-bot-old · 2021-09-14T04:03:25Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

python/paddle/fluid/optimizer.py

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

python/paddle/fluid/optimizer.py

wangxicoding

加上开MP和不开MP时，混合并行精度的对比

FeixLiu · 2021-09-15T06:24:47Z

gradient clip 数值改为1e-6后，dev分支，pr分支在w/ mp与w/o mp下loss基本对齐

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

wangxicoding · 2021-09-15T07:31:06Z

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

+                # Therefore, we prune those duplicated vars for grad clip.
+                if mp_rank > 0 and (not (hasattr(input_var, 'is_distributed')
+                                         and input_var.is_distributed)):
+                    removed_op_idx.append(idx)


其实当前这种删法是不严谨的，按照之前global_clip的square、sum op实现会出问题。不过目前改成了squarel2norm实现，不会出问题就是了。

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

wangxicoding

LGTM

wangxicoding · 2021-09-16T10:16:54Z

python/paddle/fluid/tests/unittests/test_fleet_sharding_meta_optimizer.py

+            'cast', 'sum', 'fill_constant', 'cast', 'sum', 'fill_constant',
+            'cast', 'sum', 'c_sync_comm_stream', 'check_finite_and_unscale',
+            'cast', 'c_allreduce_max', 'c_allreduce_max', 'cast',
+            'update_loss_scaling', 'fill_constant', 'c_allreduce_sum',


wangxicoding · 2021-09-16T10:17:16Z

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

-                            OP_ROLE_KEY: OpRole.Optimize,
-                        })
-                return
+        for idx, op in list(enumerate(block.ops)):


其实反向遍历更好点点

sandyhouse

LGTM

wangxicoding requested changes Sep 14, 2021

View reviewed changes

python/paddle/fluid/optimizer.py Outdated Show resolved Hide resolved

python/paddle/fluid/optimizer.py Show resolved Hide resolved

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py Show resolved Hide resolved

wangxicoding requested changes Sep 14, 2021

View reviewed changes

wangxicoding reviewed Sep 14, 2021

View reviewed changes

FeixLiu force-pushed the fix_mp_multi_gradient_clip_prob branch from 5e1fe33 to 7ebecff Compare September 15, 2021 01:32

wangxicoding reviewed Sep 15, 2021

View reviewed changes

FeixLiu force-pushed the fix_mp_multi_gradient_clip_prob branch from a9763af to ee1eff2 Compare September 16, 2021 01:25

wangxicoding reviewed Sep 16, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py Outdated Show resolved Hide resolved

FeixLiu added 15 commits September 16, 2021 14:58

add is_distributed to var __str__ method

0caf416

keep is_distributed info when pp copies vars

89ee749

is_distributed info pass to grad merge var

b0457d7

prune mp's duplicated var during gradient clip

7b10da8

remove framework modification

02117b0

extract the copy attr method

0cca490

deteremine global clip or not

f7f73cc

add note, bug fix

ea117c8

update file

644cc3e

bug fix

45daa57

replace list with set

86268d8

fix a potential bug

c5042ad

fix no allreduce prob after pruning sum op

a045721

early return

383fb5f

add comment

8b2fabc

FeixLiu added 2 commits September 16, 2021 14:58

fix ci

7588284

restruct the logic

d9b6d50

FeixLiu force-pushed the fix_mp_multi_gradient_clip_prob branch from 70baf73 to d9b6d50 Compare September 16, 2021 06:58

update ut

7dd5b95

wangxicoding approved these changes Sep 16, 2021

View reviewed changes

wangxicoding requested review from sandyhouse, JZ-LIANG and gongweibao September 16, 2021 10:18

wangxicoding changed the title ~~Fix mp multi gradient clip prob~~ [hybrid] Fix mp multi gradient clip prob Sep 16, 2021

wangxicoding requested a review from fuyinno4 September 16, 2021 11:07

sandyhouse approved these changes Sep 16, 2021

View reviewed changes

wangxicoding merged commit a4eadd1 into PaddlePaddle:develop Sep 16, 2021

FeixLiu deleted the fix_mp_multi_gradient_clip_prob branch September 16, 2021 23:58

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021

[hybrid] Fix mp multi gradient clip prob (PaddlePaddle#35713)

e2e895b

FeixLiu changed the title ~~[hybrid] Fix mp multi gradient clip prob~~ [hybrid bug fix] Fix mp multi gradient clip prob Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[hybrid bug fix] Fix mp multi gradient clip prob #35713

[hybrid bug fix] Fix mp multi gradient clip prob #35713

Uh oh!

FeixLiu commented Sep 14, 2021

Uh oh!

paddle-bot-old bot commented Sep 14, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxicoding left a comment

Uh oh!

FeixLiu commented Sep 15, 2021

Uh oh!

Uh oh!

wangxicoding Sep 15, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxicoding left a comment

Uh oh!

wangxicoding Sep 16, 2021

Uh oh!

wangxicoding Sep 16, 2021

Uh oh!

sandyhouse left a comment

Uh oh!

Uh oh!

[hybrid bug fix] Fix mp multi gradient clip prob #35713

[hybrid bug fix] Fix mp multi gradient clip prob #35713

Uh oh!

Conversation

FeixLiu commented Sep 14, 2021

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Sep 14, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxicoding left a comment

Choose a reason for hiding this comment

Uh oh!

FeixLiu commented Sep 15, 2021

Uh oh!

Uh oh!

wangxicoding Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxicoding left a comment

Choose a reason for hiding this comment

Uh oh!

wangxicoding Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

wangxicoding Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

sandyhouse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!