Skip to content

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented Sep 15, 2022

PR types

Function optimization

PR changes

OPs

Describe

  • Feature :
    To package the dataloader while most of input tensors need broadcast, and improve the performance of broadcast kernel performance in below conditions :

  • source : op benchmark case_8

input_1.shape input_2.shape Dtype PaddlePR /us PaddleDev /us Perf Diff with Dev Pytorch /us Perf Diff with Torch
[32,1,1,128] [1,12,128,1] FP16 24.2 35.03 ↑ 30.92% 27.5 +12%
  • source : AlphaFold typical ternary broadcast cases
input_1.shape input_2.shape input_3.shape Dtype PaddlePR /us SpeedUp with FP32 (PR) PaddleDev /us SpeedUp with FP32 (Dev) PR perf with Dev
[1, 256, 4, 256, 256] [1, 256, 1, 1, 256] [1, 1, 4, 256, 256] FP32 398.46 1.00 434.82 1.00 +8.36%
-- -- -- BF16 263.36 1.51 411.87 1.06 +36.06%
-- -- -- FP16 242.72 1.64 406.26 1.07 +40.26%
                 
[1, 2048, 3584] [1, 1, 3584] [1, 2048, 1] FP32 49.93 1.00 49.92 1.00 -0.02%
-- -- -- BF16 30.08 1.66 38.62 1.29 +22.10%
-- -- -- FP16 27.56 1.81 36.23 1.38 +23.93%
                 
[1, 256, 256] [1, 1, 256] [1, 256, 1] FP32 5.86 1.00 5.96 1.00 +1.54%
-- -- -- BF16 5.73 1.02 5.94 1.00 +3.53%
-- -- -- FP16 5.67 1.03 5.77 1.03 +1.80%
  • source: Ternary add kernel performance in fused_gate_attention in AlphaFold
Dtype PaddlePR /us SpeedUp with FP32 (PR) PaddleDev /us SpeedUp with FP32 (Dev) PR perf with Dev
FP32 465.69 1.00 558.19 1.00 +16.57%
BF16 307.31 1.52 595.16 0.94 +48.36%

@paddle-bot
Copy link

paddle-bot bot commented Sep 15, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@JamesLim-sy JamesLim-sy changed the title first commit Performance fix for broadcast kernel [Part4] Sep 15, 2022
@JamesLim-sy JamesLim-sy changed the title Performance fix for broadcast kernel [Part4] Performance fix for broadcast kernel [Part3] Sep 17, 2022
@JamesLim-sy
Copy link
Contributor Author

Succussfully built in local Kunlun-KP-Build enviroment.

num,
block_offset,
read_lens,
func);
}
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实最开始KP的设想是尽可能不加这种判断,加了之后和写两份Kernel就没区别了。。。。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlphaFold优化起来实在是想不出来其他的优化内容了... 优化这种计算内容简单但性能要求很高的Kernel,就跟在沙漠里面养花一样 T_T

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants