Skip to content

【开源任务】Paddle CPU/GPU Kernel 精度问题推全 #72667

@lshpku

Description

@lshpku

一、背景

Paddle目前正在对全量API的边界正确性做系统性排查,我们开发了PaddleAPITest用于测试存在正确性问题的API。通过与Torch执行相同的API进行精度对比,我们发现一些API与Torch的API存在精度diff。经初步少量API确认,我们发现Paddle API确实存在一些正确性问题(过程中也发现了少量Torch API的正确性问题,如torch.tril、torch.triu)。现将这些问题Paddle API公开,邀请社区同学共同解决问题。
参与本项活动,你将学习到Paddle算子库框架的设计,并对Paddle CPU、GPU Kernel的实现风格有详细的了解,对算子精度问题的调试技能积累一定经验。

二、任务描述

2.1 任务简介以及任务分配

针对通过PaddleAPITest测试出的一些和torch存在精度diff的Paddle API,查找其出现的原因并进行修复。
存在精度diff的API以及任务分配如下:

Important

每个任务难度:0.15×🌟
题目讲解见录屏文件:https://meeting.tencent.com/crm/l59EWmRZc4 (00:52:00~00:59:30)

序号 API kernel类别 报名人/状态/PR
1 paddle.add_n CPU @BeingGod
@cangtianhuang #73562
2 paddle.all GPU @BeingGod
@cangtianhuang
@co63oc #280
3 paddle.all CPU @BeingGod
@cangtianhuang
@co63oc #280
4 paddle.any GPU @BeingGod
@cangtianhuang
@co63oc #280
5 paddle.any CPU @BeingGod
@cangtianhuang
@co63oc #280
6 paddle.argsort GPU @cszdrg #219
7 paddle.argsort CPU @cszdrg #219
8 paddle.bitwise_right_shift CPU @ooooo-create #194
9 paddle.broadcast_to GPU @ooooo-create #72992
10 paddle.broadcast_to CPU @ooooo-create #72992
11 paddle.clip CPU @BeingGod
@ooooo-create
@co63oc #73492
12 paddle.concat CPU @BeingGod
@NKNaN #329
13 paddle.copysign GPU @BeingGod
14 paddle.count_nonzero CPU @BeingGod
@co63oc #280
15 paddle.cumprod CPU @BeingGod
@ooooo-create #72897 #212
16 paddle.cumsum GPU @BeingGod
@cangtianhuang #73658 #74081
17 paddle.cumsum CPU @BeingGod
@cangtianhuang
@fxyfxy777 #526
18 paddle.diag CPU @cangtianhuang #269
19 paddle.dot GPU @co63oc #316
20 paddle.expand CPU @ccsuzzh
@ooooo-create #72992
21 paddle.heaviside CPU @ooooo-create #72894
22 paddle.kron CPU @ooooo-create #73521
23 paddle.linalg.cholesky_solve GPU @NKNaN #330
24 paddle.linalg.cholesky_solve CPU @NKNaN #330
25 paddle.linalg.eigh GPU @NKNaN #276
26 paddle.linalg.eigvals CPU @XvLingWYY
@ooooo-create #429
27 paddle.linalg.eigvalsh CPU @XvLingWYY
@ooooo-create #429
28 paddle.linalg.norm GPU @ooooo-create #197
29 paddle.linalg.norm CPU @ooooo-create #197
30 paddle.linalg.pinv GPU @hushenwei2000
@ooooo-create #411
31 paddle.linalg.pinv CPU @hushenwei2000
@ooooo-create #411
32 paddle.linalg.svd_lowrank GPU @co63oc #347
33 paddle.linalg.svd_lowrank CPU @co63oc #347
34 paddle.logit GPU @ooooo-create #72973
35 paddle.logit CPU @ooooo-create #72973
36 paddle.maximum CPU @BeingGod
@Z-NAVY
@co63oc #286
37 paddle.minimum CPU @BeingGod
@Z-NAVY
@co63oc #286
38 paddle.nextafter GPU @ooooo-create #72965
39 paddle.nextafter CPU @ooooo-create #72965
40 paddle.nn.functional.cosine_similarity GPU @BeingGod
@Z-NAVY
@Cutelemon6 #73014
41 paddle.nn.functional.cosine_similarity CPU @BeingGod
@Z-NAVY
@Cutelemon6 #73014
42 paddle.nn.functional.embedding CPU @Z-NAVY
@ooooo-create #73445
43 paddle.nn.functional.grid_sample GPU @Juggler-YAN
@ghost
@ooooo-create #74204
@zhengshengning #74555
44 paddle.nn.functional.grid_sample CPU @Juggler-YAN
@ghost
@ooooo-create #74204
@zhengshengning #74555
45 paddle.nn.functional.gumbel_softmax GPU @ooooo-create #303
46 paddle.nn.functional.gumbel_softmax CPU @ooooo-create #303
47 paddle.nn.functional.hardsigmoid CPU @BeingGod
@ooooo-create
@co63oc #281
48 paddle.nn.functional.rrelu GPU @BeingGod
@Z-NAVY
@ooooo-create #234
49 paddle.nn.functional.rrelu CPU @BeingGod
@Z-NAVY
@ooooo-create #234
50 paddle.prod CPU @BeingGod
@ooooo-create #251
51 paddle.put_along_axis CPU @ooooo-create #218
52 paddle.reciprocal GPU @BeingGod
@Z-NAVY
@ooooo-create #73128
53 paddle.rsqrt GPU @BeingGod
@Z-NAVY
@co63oc #314
54 paddle.signal.istft CPU @co63oc #287
55 paddle.std GPU @Z-NAVY
@ooooo-create #72879
56 paddle.std CPU @Z-NAVY
@ooooo-create #72879
57 paddle.sum CPU @BeingGod
@ooooo-create #73012
58 paddle.Tensor.argsort CPU @cszdrg #219
59 paddle.Tensor.cholesky_solve GPU @Juggler-YAN
@ghost
60 paddle.Tensor.cholesky_solve CPU @Juggler-YAN
@ghost
61 paddle.Tensor.expand CPU @ooooo-create #72992
62 paddle.Tensor.fill_diagonal_ CPU @co63oc #288
63 paddle.Tensor.logit GPU @ooooo-create #72973
64 paddle.Tensor.logit CPU @ooooo-create #72973
65 paddle.Tensor.median GPU @Z-NAVY
@NKNaN #265
66 paddle.Tensor.median CPU @Z-NAVY
@NKNaN #265
67 paddle.Tensor.put_along_axis CPU @ooooo-create #218
68 paddle.Tensor.set_ CPU @NKNaN #73294
@ooooo-create #427
69 paddle.Tensor.sum CPU @BeingGod
@ooooo-create #73012
70 paddle.Tensor.topk CPU @ooooo-create #217
71 paddle.tensordot CPU @co63oc #288
72 paddle.trace CPU @ooooo-create #73018
73 paddle.unique CPU @ccsuzzh
74 paddle.unique_consecutive GPU @ccsuzzh #72948 #192
75 paddle.unique_consecutive CPU @ccsuzzh #72948 #192
76 paddle.var GPU @ooooo-create #72879
77 paddle.var CPU @ooooo-create #72879
78 paddle.cast CPU @Glencsa #73710
@zrr1999 #528
79 paddle.combinations GPU @ooooo-create #73293
80 paddle.cumulative_trapezoid CPU @ooooo-create #73317
81 paddle.diag GPU @cangtianhuang #269
82 paddle.diagflat GPU @cangtianhuang #269
83 paddle.diagonal_scatter GPU @ooooo-create #304
84 paddle.einsum GPU @hushenwei2000
@ooooo-create #74257
85 paddle.einsum CPU @hushenwei2000
@zrr1999 #445
@ooooo-create #74257
86 paddle.gammaln GPU @NKNaN #73344
87 paddle.geometric.send_ue_recv CPU @ooooo-create #327
88 paddle.geometric.send_uv CPU @cangtianhuang #249
89 paddle.incubate.nn.functional.fused_bias_act GPU @hushenwei2000 #389
90 paddle.incubate.nn.functional.fused_bias_dropout_residual_layer_norm GPU @hushenwei2000 #74149
91 paddle.incubate.nn.functional.fused_layer_norm CPU @ooooo-create #343
92 paddle.incubate.nn.functional.fused_linear GPU @fangfangssj
93 paddle.incubate.nn.functional.fused_linear_activation GPU @fangfangssj
94 paddle.incubate.nn.functional.fused_multi_head_attention GPU @Cutelemon6
@fangfangssj
95 paddle.incubate.nn.functional.fused_rotary_position_embedding GPU @hushenwei2000 #453
@fangfangssj
96 paddle.incubate.nn.functional.variable_length_memory_efficient_attention GPU @fangfangssj
@hushenwei2000 #498
97 paddle.incubate.softmax_mask_fuse GPU @ooooo-create #342
98 paddle.index_put GPU @ooooo-create #308 #446
@zrr1999 #510
99 paddle.lerp GPU @co63oc #278
100 paddle.linalg.cond CPU @ooooo-create #73229
101 paddle.linalg.eigh CPU @NKNaN #276
102 paddle.linalg.lstsq CPU @ooooo-create #407 #74160
103 paddle.linalg.matrix_rank CPU @ooooo-create #73295
104 paddle.matmul GPU @co63oc #73569
105 paddle.max CPU @ooooo-create #73229
106 paddle.nn.functional.adaptive_avg_pool2d GPU @zrr1999 #74077
107 paddle.nn.functional.adaptive_avg_pool3d GPU @zrr1999 #74102
108 paddle.nn.functional.avg_pool3d GPU @zrr1999 #400
109 paddle.nn.functional.batch_norm GPU @zrr1999 #402
110 paddle.nn.functional.binary_cross_entropy GPU @NKNaN #267 #270
111 paddle.nn.functional.binary_cross_entropy_with_logits GPU @NKNaN #267 #270
112 paddle.nn.functional.conv1d CPU @co63oc #306 #73535
113 paddle.nn.functional.conv2d GPU @co63oc #346
114 paddle.nn.functional.conv2d_transpose CPU @zrr1999
115 paddle.nn.functional.cross_entropy GPU @zrr1999 #410
116 paddle.nn.functional.ctc_loss GPU @ooooo-create #277
117 paddle.nn.functional.dice_loss GPU @NKNaN #319
118 paddle.nn.functional.embedding GPU @ooooo-create #73445
119 paddle.nn.functional.gaussian_nll_loss GPU @cangtianhuang #272
120 paddle.nn.functional.interpolate GPU @cszdrg #74219 #431
121 paddle.nn.functional.interpolate CPU @cszdrg #74219 #431
122 paddle.nn.functional.kl_div GPU @co63oc #279
123 paddle.nn.functional.linear CPU @zrr1999 #445
124 paddle.nn.functional.log_softmax GPU @ooooo-create #271
125 paddle.nn.functional.multi_margin_loss GPU @co63oc #282 #73739
126 paddle.nn.functional.rnnt_loss CPU @ooooo-create #266
127 paddle.nn.functional.sigmoid_focal_loss GPU @NKNaN #73430 #292
128 paddle.nn.functional.softmax_with_cross_entropy GPU @NKNaN #317
129 paddle.nn.functional.temporal_shift GPU @ooooo-create #326
130 paddle.nn.functional.upsample GPU @cszdrg #74219 #431
131 paddle.nn.utils.parameters_to_vector GPU @co63oc #278
132 paddle.outer GPU @NKNaN #73324
@ooooo-create #74182
133 paddle.pow GPU @cszdrg #73244 #73274
134 paddle.prod GPU @ooooo-create #251
135 paddle.put_along_axis GPU @zrr1999
136 paddle.scale GPU @NKNaN #264
137 paddle.scatter GPU @ooooo-create #302
138 paddle.sort CPU @ooooo-create #259
139 paddle.strided_slice CPU @ooooo-create #250
140 paddle.Tensor.matmul CPU @co63oc #331 #73094
141 paddle.Tensor.mul CPU @Glencsa #73770
@ooooo-create #74198
142 paddle.Tensor.astype CPU @Glencsa #73710
@zrr1999 #528
143 paddle.Tensor.cast CPU @Glencsa #73710
@zrr1999 #528
144 paddle.Tensor.clip GPU @co63oc #73492
145 paddle.Tensor.fill_diagonal_tensor GPU @cangtianhuang #272
146 paddle.Tensor.mean GPU @NKNaN #73320
147 paddle.Tensor.set_ GPU @NKNaN #73294
@ooooo-create #427
148 paddle.Tensor.tile GPU @ooooo-create #73454
149 paddle.tensordot GPU @co63oc #288
150 paddle.trapezoid CPU @ooooo-create #73317
151 paddle.vander CPU @cszdrg #252
152 paddle.vision.ops.deform_conv2d GPU @co63oc #366
153 paddle.vision.ops.deform_conv2d CPU @co63oc #74009
154 paddle.vision.ops.distribute_fpn_proposals GPU @cszdrg #252
155 paddle.incubate.nn.functional.fused_layer_norm GPU output dtype类型不同,暂不要报名
156 paddle.nn.functional.conv2d_transpose GPU output dtype类型不同,暂不要报名
157 paddle.nn.functional.linear GPU output dtype类型不同,暂不要报名
158 paddle.vision.ops.roi_align GPU output dtype类型不同,暂不要报名
159 paddle.Tensor.cumsum GPU output dtype类型不同,暂不要报名
160 paddle.Tensor.frexp GPU output dtype类型不同,暂不要报名
161 paddle.add GPU output dtype类型不同,暂不要报名
162 paddle.add_n GPU output dtype类型不同,暂不要报名
163 paddle.clip GPU output dtype类型不同,暂不要报名@hushenwei2000 #74719
164 paddle.copysign GPU output dtype类型不同,暂不要报名
165 paddle.cummax GPU output dtype类型不同,暂不要报名
166 paddle.cummin GPU output dtype类型不同,暂不要报名
167 paddle.cumsum GPU output dtype类型不同,暂不要报名@hushenwei2000 #74625
168 paddle.floor GPU output dtype类型不同,暂不要报名@hushenwei2000 #74598
169 paddle.frexp GPU output dtype类型不同,暂不要报名
170 paddle.histogram GPU output dtype类型不同,暂不要报名
171 paddle.linalg.lstsq GPU output dtype类型不同,暂不要报名
172 paddle.nn.functional.adaptive_max_pool1d GPU output dtype类型不同,暂不要报名
173 paddle.nn.functional.adaptive_max_pool2d GPU output dtype类型不同,暂不要报名
174 paddle.nn.functional.adaptive_max_pool3d GPU output dtype类型不同,暂不要报名
175 paddle.nn.functional.max_pool1d GPU output dtype类型不同,暂不要报名
176 paddle.nn.functional.max_pool2d GPU output dtype类型不同,暂不要报名
177 paddle.nn.functional.max_pool3d GPU output dtype类型不同,暂不要报名
178 paddle.nn.functional.one_hot GPU output dtype类型不同,暂不要报名
179 paddle.nn.functional.smooth_l1_loss GPU output dtype类型不同,暂不要报名
180 paddle.where GPU output dtype类型不同,暂不要报名
181 paddle.maximum GPU @wanghuancoder #71716
182 paddle.minimum GPU @wanghuancoder #71716
183 paddle.Tensor.getitem GPU @wanghuancoder #71716
184 paddle.ldexp GPU 在float16 output dtype类型不同,暂不要报名
185 paddle.ldexp CPU 在float16 output dtype类型不同,暂不要报名
186 paddle.ldexp_ GPU 在float16 output dtype类型不同,暂不要报名
187 paddle.ldexp_ CPU 在float16 output dtype类型不同,暂不要报名

2.2 修复建议

  1. 对于精度diff问题,可以将input、output Tensor使用numpy.savetxt保存到文件,分析diff数据,看看能否找到错误规律
  2. 对于CUDA Error报错,可以通过如下代码调试:
void test_cuda(const std::string& str) {
std::cout << str << " begin" << std::endl;
// 1. wait all kernel finish
PADDLE_ENFORCE_GPU_SUCCESS(cudaDeviceSynchronize());
// 2. get error state
PADDLE_ENFORCE_GPU_SUCCESS(cudaGetLastError());
// 3. check if cuda 700
size_t bytes = 256;
char* cuda_mem;
char* cpu_mem = new char[bytes + 1];
cudaMalloc(&cuda_mem, bytes + 1);
cudaMemset(cuda_mem, 0, bytes + 1);
cudaMemcpyAsync(cpu_mem, cuda_mem, bytes, cudaMemcpyDeviceToHost);
cudaFree(cuda_mem);
delete[] cpu_mem;
std::cout << str << " end" << std::endl;
}

在XXXKernel实现逻辑的各个环节“打桩”调用该函数,第一次崩溃位置的前面代码,往往是造成CUDA Error的地方
3. 对于Tensor精度调试,使用numpy.savetxt保存的文件太大不可行,还可以用如下方式调试:
4.

import torch
import paddle
import numpy
device = torch.device("cuda:0")
torch.set_default_device(device)
def init_input(numpy_tensor):
paddle_x = paddle.to_tensor(numpy_tensor)
torch_x = torch.tensor(numpy_tensor, requires_grad=True)
paddle_x.stop_gradient = False
numpy.testing.assert_allclose(
paddle_x.numpy(),
torch_x.cpu().detach().numpy(),
1e-10,
1e-10,
err_msg='intput diff'
)
return paddle_x, torch_x
# paddle.amax(Tensor([3, 38028357, 4, 5],"float32"), axis=-1, keepdim=True, )
input_tensor = (numpy.random.random([3, 38028357, 4, 5]) - 0.5).astype("float32")
paddle_x, torch_x = init_input(input_tensor)
paddle_out = paddle.amax(paddle_x, axis=-1, keepdim=True)
torch_out = torch.amax(torch_x, dim=-1, keepdim=True)
print(paddle_out.shape)
print(torch_out.shape)
numpy_tensor = (numpy.random.random([3, 38028357, 4, 1]) - 0.5).astype("float32")
paddle_grad, torch_grad = init_input(numpy_tensor)
torch_x_grad = torch.autograd.grad([torch_out], [torch_x], grad_outputs=torch_grad)
paddle_x_grad = paddle.grad([paddle_out], [paddle_x], grad_outputs=paddle_grad, allow_unused=True)
p = paddle_x_grad[0].numpy()
t = torch_x_grad[0].cpu().detach().numpy()
for i in range(3):
for j in range(38028357):
for k in range(4):
for m in range(5):
if p[i][j][k][m] != t[i][j][k][m]:
print("i = {}".format(i), "j = {}".format(j), "k = {}".format(k), "m = {}".format(m), "paddle = {}".format(p[i][j][k][m]), "torch = {}".format(t[i][j][k][m]), "numpy_tensor = {}".format(numpy_tensor[i][j][k][0]), "paddle_out = {}".format(paddle_out[i][j][k][0]), "torch_out = {}".format(torch_out[i][j][k][0]))
print("input_tensor = ", "{}".format(input_tensor[i][j][k][0]), "{}".format(input_tensor[i][j][k][1]), "{}".format(input_tensor[i][j][k][2]), "{}".format(input_tensor[i][j][k][3]), "{}".format(input_tensor[i][j][k][4]))
print("p = ", "{}".format(p[i][j][k][0]), "{}".format(p[i][j][k][1]), "{}".format(p[i][j][k][2]), "{}".format(p[i][j][k][3]), "{}".format(p[i][j][k][4]))
print("t = ", "{}".format(t[i][j][k][0]), "{}".format(t[i][j][k][1]), "{}".format(t[i][j][k][2]), "{}".format(t[i][j][k][3]), "{}".format(t[i][j][k][4]))
numpy.testing.assert_allclose(
paddle_out.numpy(),
torch_out.cpu().detach().numpy(),
1e-2,
1e-2,
err_msg='output diff'
)
numpy.testing.assert_allclose(
paddle_x_grad[0].numpy(),
torch_x_grad[0].cpu().detach().numpy(),
1e-2,
1e-2,
err_msg='output diff'
)

2.3 注意事项

  1. 错误配置、报错日志可以在https://github.com/PFCCLab/PaddleAPITest/tree/main/report/ci_ce_cpuhttps://github.com/PFCCLab/PaddleAPITest/tree/main/report/ci_ce_gpu中查看
  2. 给Paddle提PR,每修复1个API一个PR
  3. 有些API,比如paddle.max、paddle.Tensor.max其实是一个API,尽量同时报名,防止修复重复了。同时,比如paddle.max、paddle.min也建议一起分析一起修复
  4. 修复后,应给相应API的单测添加新的测试用例
  5. 如果所有配置均无法复现错误,可能是已经被其它人修好了,请在issue中告知未复现的api
  6. 参考修复PR:精度问题:PR71716

三、测试工具

本项目使用PaddleAPITest工具来排查问题API、问题case。
通过如下命令可以执行一个paddle.abs与torch.abs的精度前反向对比测试:python engine.py --accuracy=True --api_config='paddle.abs(Tensor([1, 100],"float64"), )' 。“accuracy”表示执行精度对比测试,“api_config”表示要测试的配置。注意:配置中的双引号需要加斜杠。
所有测试前,必须创建一个目录PaddleAPITest/tester/api_config/test_log/,用于存放测试所产生的测试结果和checkpoint。
PaddleAPITest目前支持paddle_only、accuracy、paddle_cinn三种测试:

  • paddle_only:用于单纯把配置在Paddle动态图跑一遍,验证PaddleAPITest “引擎”是否支持该配置,但不验证精度。
  • accuracy:用于将Paddle API的前反向与Torch的前反向做精度对比测试。本次任务主要用到这项测试
  • paddle_cinn:用于Paddle动态图与Paddle静态图编译器做精度对比测试。与本次任务无关
    此外:
  • api_config,用于测试单独一个配置使用,如:python engine.py --accuracy=True --api_config='paddle.abs(Tensor([1, 100],"float64"), )'
  • api_config_file,用于测试多个配置使用,可以将要测试的所有配置放到一个文件中,如:python engine.py --api_config_file=PaddleAPITest/tester/api_config/api_config.txt --accuracy=True > tester/api_config/test_log/log.log 2>&1
  • 由于测试过程中可能触发一些bug,导致程序Coredump,进而导致测试中断。PaddleAPITest开发了checkpoint机制,记录“已测试”配置。可以通过如下命令启动批量测试:for i in {1..10000}; do python engine.py --api_config_file=PaddleAPITest/tester/api_config/api_config_merge.txt --accuracy=True >> tester/api_config/test_log/log.log 2>&1; done

四、如何开始

4.1 任务认领

Important

请务必严格按照格式填写,否则快乐开源小助手无法自动更新信息
请大家在 issue 下以 comment 的形式认领任务,否则无法正确报名,格式如下:

【报名】: 1、2、3-5

多个任务之间使用中文顿号分隔,多个连续任务可用横线表示

4.2 环境准备

建议使用Docker环境进行开发

wget https://paddle-qa.bj.bcebos.com/benchmark/pretrained/torch_whl_250.tar.gz
tar xf torch_whl_250.tar.gz && cd torch_whl_250
pip install torch*.whl --no-index --find-links .
  • 安装PaddleAPITest:直接clone该项目即可,无需安装

4.3 提交PR

使用以下模版提交PR:
// ------- PR 标题 --------

[Accuracy diff No.xxx] Fix accuracy diff for xxx API

// ------- PR 内容 --------

PR Category
Execute Infrastructure
PR types
Bug fixes
Description
xx的修改历程。
(修改细节、测试结果等等)

注:为了便于review,请务必在PR中写下如下信息:

看板信息

任务方向 任务数量 提交作品 / 任务认领 提交率 完成 完成率
精度问题修复 187 151 / 187 80.75% 145 77.54%

统计信息

排名不分先后 @cangtianhuang (8) @co63oc (27) @cszdrg (9) @ooooo-create (61) @fxyfxy777 (1) @NKNaN (14) @Cutelemon6 (2) @zhengshengning (2) @ccsuzzh (2) @zrr1999 (11) @hushenwei2000 (5) @wanghuancoder (3)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions