-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Performance Regression on CUDA10 #14725
Description
Description
I have observed performance regression on LSTM model. The performance drop from around 950 samples/ sec to around 780 samples/sec. I compared the result with mxnet-cu100mkl and mxnet-cu92mkl. Both of them are 1.4.0 without using nightly build.
Environment info (Required)
Deep Learning AMI (Ubuntu) Version 22.0 with P3.16xlarge
Steps to reproduce
(Paste the commands you ran that produced the error.)
- lanuch a DL ami or DL base ami with p3.16xlarge(it will only use 1 gpu)
- sudo apt-get install python3-pip && sudo pip3 install virtualenv
- python3 -m venv mxnet-cu100mkl/mxnet-cu92mkl
- source mxnet-cu100mkl/bin/activate or mxnet-cu92mkl/bin/activate
- pip2 install psutil --user && pip2 install --upgrade pandas --user && pip2 install mxnet-cu100mkl or mxnet-cu92mkl
- git clone https://github.com/awslabs/deeplearning-benchmark.git
- cd deeplearning-benchmark
- python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local
- use sudo rm /usr/local/cuda/ sudo ln -s /usr/local/cuda-9.2 /usr/local/cuda to switch cuda version
Sample result
cu100mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 340.001303, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 632.3600240369404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 787.0409230769234, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 534.887, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1131.2, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 796, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 361.069259, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 4132.82, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1414.0}
cu92mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 328.856996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 633.2544512279404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 916.6658846153846, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 458.66499999999996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1132.8, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 736, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 351.549512, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 9839.61, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1416.0}
More data points
G3dn.16xlarge -cu100mkl aveage throughtput: 445
G3dn.16xlarge -cu92mkl average throughput: 444
On G3dn(Tesla M60) I don't see the performance regression
P3.16xlarge -cu100 aveage throughtput: 779
P3.16xlarge -cu92 aveage throughtput: 930
Update suggested by @KellenSunderland
nvprof --profile-child-processes -s python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 1 --metrics-suffix test --kvstore local > nvprof_output_cu100mkl 2>&1
cu100mkl
==22332== NVPROF is profiling process 22332, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Warning: 29 API trace records have same start and end timestamps.
This can happen because of short execution duration of CUDA APIs and low timer resolution on the underlying operating system.
==22332== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 32.96% 11.4365s 139948 81.719us 60.160us 1.9775ms volta_sgemm_128x64_nt
26.94% 9.34508s 136024 68.701us 63.808us 83.135us volta_sgemm_32x32_sliced1x4_nn
11.46% 3.97589s 150520 26.414us 24.319us 34.656us volta_sgemm_64x32_sliced1x4_tn
3.53% 1.22311s 1308 935.10us 361.25us 2.0452ms volta_sgemm_128x64_nn
2.30% 797.69ms 373330 2.1360us 2.0150us 14.400us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
2.13% 739.36ms 349570 2.1150us 1.9840us 14.464us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
2.13% 738.49ms 687 1.0750ms 980.92us 1.2799ms volta_sgemm_32x128_tn
1.63% 564.77ms 1308 431.78us 163.55us 972.06us void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
1.40% 484.33ms 133408 3.6300us 2.8800us 16.319us void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
1.32% 456.59ms 151933 3.0050us 1.7920us 164.10us void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
1.26% 435.71ms 291784 1.4930us 1.3750us 14.048us [CUDA memset]
1.16% 402.49ms 566 711.11us 570.72us 1.7457ms volta_sgemm_128x64_tn
1.14% 396.27ms 14388 27.541us 1.3760us 96.448us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
1.08% 376.09ms 1422 264.48us 110.75us 579.74us void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
1.02% 352.83ms 225780 1.5620us 1.5030us 15.008us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
1.00% 345.93ms 207960 1.6630us 1.5990us 13.696us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.91% 316.27ms 207960 1.5200us 1.3750us 13.824us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
0.87% 300.73ms 205344 1.4640us 1.3110us 13.472us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
0.83% 288.70ms 228510 1.2630us 992ns 13.760us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
0.75% 260.61ms 150520 1.7310us 1.5680us 14.080us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.68% 236.88ms 138640 1.7080us 1.6310us 13.664us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.67% 230.86ms 153250 1.5060us 992ns 14.304us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
0.61% 213.10ms 133408 1.5970us 1.3750us 14.368us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
0.38% 131.73ms 14616 9.0120us 1.0240us 108.19us [CUDA memcpy DtoD]
0.37% 127.57ms 1308 97.532us 40.031us 214.02us void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
0.33% 112.78ms 1299 86.822us 32.863us 164.67us void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
0.28% 96.707ms 6540 14.787us 2.8160us 116.96us void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
0.18% 62.853ms 169 371.91us 347.84us 465.53us volta_sgemm_128x128_tn
0.17% 57.555ms 5460 10.541us 9.8240us 13.824us void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
0.15% 50.856ms 1332 38.180us 1.2480us 46.112us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
0.14% 47.081ms 5471 8.6050us 1.5030us 14.016ms [CUDA memcpy DtoH]
0.12% 42.434ms 5608 7.5660us 1.5030us 5.8883ms [CUDA memcpy HtoD]
0.03% 10.385ms 1422 7.3030us 3.5200us 13.824us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
0.02% 6.6979ms 2730 2.4530us 1.5680us 10.912us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
0.02% 6.3452ms 5688 1.1150us 991ns 12.768us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
0.01% 4.1586ms 2730 1.5230us 1.1840us 10.592us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
0.01% 3.6414ms 2730 1.3330us 992ns 7.9680us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
0.01% 3.4687ms 2730 1.2700us 1.0560us 13.024us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
0.01% 3.2744ms 2730 1.1990us 960ns 14.112us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
0.01% 3.0465ms 2730 1.1150us 992ns 13.184us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
0.01% 2.1137ms 9 234.85us 232.29us 236.06us void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
0.01% 2.1085ms 9 234.28us 232.29us 236.29us void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
API calls: 56.38% 44.7969s 3145400 14.242us 6.1280us 23.314ms cudaLaunchKernel
19.23% 15.2759s 336467 45.400us 3.7720us 4.0563ms cudaStreamSynchronize
5.44% 4.32640s 1503 2.8785ms 155.14us 3.87303s cudaMemGetInfo
4.45% 3.53308s 291784 12.108us 4.8700us 95.429ms cudaMemsetAsync
3.03% 2.40724s 18 133.74ms 884ns 635.37ms cudaFree
2.67% 2.12090s 291776 7.2680us 2.0380us 2.3136ms cudaEventQuery
2.35% 1.86598s 1993278 936ns 0ns 1.0557ms cudaPeekAtLastError
2.23% 1.77185s 25683 68.989us 8.3060us 23.490ms cudaMemcpy2DAsync
1.38% 1.10031s 1152674 954ns 0ns 975.49us cudaGetLastError
1.10% 877.91ms 291778 3.0080us 981ns 12.349ms cudaEventRecord
0.89% 704.22ms 288598 2.4400us 624ns 10.880ms cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.42% 333.73ms 166750 2.0010us 73ns 1.1830ms cudaSetDevice
0.24% 194.41ms 67011 2.9010us 365ns 1.2253ms cudaGetDevice
0.12% 93.806ms 1547 60.637us 10.066us 2.2598ms cudaMalloc
0.04% 30.262ms 2256 13.413us 467ns 1.6574ms cuDeviceGetAttribute
0.01% 9.0654ms 24 377.73us 239.21us 574.83us cuDeviceTotalMem
0.01% 5.5531ms 4 1.3883ms 954.64us 1.7572ms cudaGetDeviceProperties
0.00% 2.2392ms 24 93.299us 71.233us 155.01us cuDeviceGetName
0.00% 1.5461ms 912 1.6950us 518ns 73.179us cudaFuncSetAttribute
0.00% 1.4418ms 216 6.6750us 830ns 383.84us cudaEventCreateWithFlags
0.00% 1.3181ms 2 659.06us 35.784us 1.2823ms cudaHostAlloc
0.00% 841.92us 12 70.159us 17.552us 539.11us cudaMemcpy
0.00% 812.45us 22 36.929us 13.082us 92.853us cudaStreamCreateWithFlags
0.00% 610.40us 4 152.60us 62.885us 301.45us cudaStreamCreate
0.00% 481.01us 8 60.126us 13.224us 343.28us cudaStreamCreateWithPriority
0.00% 277.53us 202 1.3730us 658ns 23.334us cudaDeviceGetAttribute
0.00% 48.017us 32 1.5000us 635ns 4.4500us cuDeviceGet
0.00% 28.351us 8 3.5430us 2.8770us 5.5370us cuDeviceGetPCIBusId
0.00% 21.491us 6 3.5810us 1.7350us 6.8090us cudaEventCreate
0.00% 21.416us 24 892ns 522ns 1.5600us cuDeviceGetUuid
0.00% 6.6090us 2 3.3040us 3.0640us 3.5450us cudaHostGetDevicePointer
0.00% 6.2010us 5 1.2400us 631ns 2.1050us cuDeviceGetCount
0.00% 4.8960us 4 1.2240us 709ns 2.1750us cudaGetDeviceCount
0.00% 4.5440us 2 2.2720us 2.0340us 2.5100us cudaDeviceGetStreamPriorityRange
0.00% 4.3850us 2 2.1920us 1.7190us 2.6660us cuInit
0.00% 3.8450us 2 1.9220us 1.3320us 2.5130us cuDriverGetVersion
cu92mkl
==21941== NVPROF is profiling process 21941, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Warning: Found 55 invalid records in the result.
==21941== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==21941== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 33.70% 9.45188s 136024 69.486us 65.248us 87.616us volta_sgemm_32x32_sliced1x4_nn
16.70% 4.68282s 139948 33.461us 17.664us 2.2807ms volta_sgemm_128x64_nt
15.80% 4.43089s 150520 29.437us 25.984us 39.232us volta_sgemm_32x32_sliced1x4_tn
4.36% 1.22354s 1308 935.43us 361.79us 2.3943ms volta_sgemm_128x64_nn
4.17% 1.16848s 1422 821.72us 346.91us 1.9999ms volta_sgemm_128x128_tn
2.85% 800.01ms 373330 2.1420us 2.0150us 13.888us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
2.66% 745.25ms 349570 2.1310us 1.9840us 14.208us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
2.02% 565.53ms 1308 432.36us 163.46us 1.0837ms void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
1.70% 477.35ms 133408 3.5780us 2.9120us 14.623us void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
1.63% 456.01ms 151933 3.0010us 2.0150us 163.81us void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
1.41% 395.84ms 14388 27.511us 1.3120us 99.583us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
1.33% 371.83ms 1422 261.48us 109.66us 580.57us void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
1.26% 352.15ms 225780 1.5590us 1.5030us 13.088us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
1.24% 346.55ms 207960 1.6660us 1.5990us 12.320us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
1.13% 315.67ms 207960 1.5170us 1.3750us 13.888us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
1.07% 300.31ms 205344 1.4620us 1.3110us 14.304us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
1.03% 289.23ms 228510 1.2650us 992ns 13.600us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
0.92% 257.58ms 150520 1.7110us 1.5360us 13.344us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.84% 235.80ms 138640 1.7000us 1.5990us 13.792us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.82% 229.41ms 153250 1.4960us 992ns 13.728us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
0.80% 224.70ms 133408 1.6840us 1.4710us 13.824us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
0.47% 132.46ms 14616 9.0620us 1.0880us 120.67us [CUDA memcpy DtoD]
0.45% 127.20ms 1308 97.248us 40.288us 213.15us void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
0.40% 112.83ms 1299 86.860us 32.864us 164.86us void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
0.37% 104.37ms 6540 15.958us 3.2640us 120.22us void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
0.21% 58.328ms 5460 10.682us 10.080us 17.088us void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
0.18% 50.957ms 1332 38.256us 1.2480us 44.320us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
0.17% 46.791ms 5471 8.5520us 1.5030us 13.900ms [CUDA memcpy DtoH]
0.16% 44.861ms 5608 7.9990us 1.5040us 5.6511ms [CUDA memcpy HtoD]
0.04% 10.481ms 1422 7.3700us 3.3600us 14.272us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
0.03% 7.1928ms 2730 2.6340us 2.1430us 14.304us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
0.02% 6.2377ms 5688 1.0960us 991ns 9.6000us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
0.01% 4.1941ms 2730 1.5360us 1.1840us 12.704us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
0.01% 3.6781ms 2730 1.3470us 992ns 7.9680us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
0.01% 3.4240ms 2730 1.2540us 1.0560us 8.5440us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
0.01% 3.2767ms 2730 1.2000us 992ns 11.456us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
0.01% 3.0875ms 2730 1.1300us 992ns 11.648us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
0.01% 2.8087ms 1821 1.5420us 1.3760us 13.760us [CUDA memset]
0.01% 2.1258ms 9 236.20us 234.02us 239.55us void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
0.01% 2.1201ms 9 235.57us 234.24us 238.05us void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
API calls: 63.12% 42.1805s 3145400 13.410us 5.3740us 25.822ms cudaLaunchKernel
18.65% 12.4655s 336467 37.048us 3.8090us 3.6356ms cudaStreamSynchronize
6.36% 4.24941s 1498 2.8367ms 154.77us 3.89087s cudaMemGetInfo
3.36% 2.24470s 18 124.71ms 728ns 566.12ms cudaFree
2.92% 1.95056s 1993245 978ns 2ns 1.4790ms cudaPeekAtLastError
2.70% 1.80358s 25683 70.224us 8.0210us 20.948ms cudaMemcpy2DAsync
1.76% 1.17849s 1152100 1.0220us 3ns 1.1467ms cudaGetLastError
0.49% 327.96ms 166750 1.9660us 68ns 993.66us cudaSetDevice
0.29% 191.04ms 67011 2.8500us 221ns 910.93us cudaGetDevice
0.16% 110.12ms 1542 71.413us 9.3680us 18.735ms cudaMalloc
0.07% 48.392ms 1821 26.574us 9.4630us 9.4157ms cudaMemsetAsync
0.04% 27.123ms 2256 12.022us 458ns 575.77us cuDeviceGetAttribute
0.02% 16.629ms 1813 9.1710us 4.4710us 62.413us cudaEventQuery
0.01% 8.4049ms 24 350.20us 240.68us 575.37us cuDeviceTotalMem
0.01% 7.6088ms 1815 4.1920us 2.1630us 44.168us cudaEventRecord
0.01% 5.3368ms 4 1.3342ms 965.25us 1.4861ms cudaGetDeviceProperties
0.00% 2.1962ms 24 91.507us 69.921us 158.19us cuDeviceGetName
0.00% 2.1156ms 216 9.7940us 833ns 1.2832ms cudaEventCreateWithFlags
0.00% 1.2896ms 2 644.81us 40.351us 1.2493ms cudaHostAlloc
0.00% 1.1971ms 4 299.28us 51.603us 761.62us cudaStreamCreate
0.00% 841.37us 12 70.113us 13.792us 535.50us cudaMemcpy
0.00% 799.59us 384 2.0820us 772ns 97.788us cudaFuncSetAttribute
0.00% 739.83us 22 33.628us 14.586us 100.29us cudaStreamCreateWithFlags
0.00% 473.78us 8 59.223us 13.159us 335.79us cudaStreamCreateWithPriority
0.00% 247.13us 202 1.2230us 642ns 3.6050us cudaDeviceGetAttribute
0.00% 99.833us 32 3.1190us 527ns 16.689us cuDeviceGet
0.00% 21.441us 6 3.5730us 1.7490us 6.7460us cudaEventCreate
0.00% 21.325us 8 2.6650us 2.1230us 4.6420us cuDeviceGetPCIBusId
0.00% 6.8890us 2 3.4440us 3.3850us 3.5040us cudaHostGetDevicePointer
0.00% 4.9550us 5 991ns 470ns 1.4300us cuDeviceGetCount
0.00% 4.8480us 2 2.4240us 2.0320us 2.8160us cudaDeviceGetStreamPriorityRange
0.00% 4.5770us 2 2.2880us 1.6370us 2.9400us cuInit
0.00% 4.5240us 4 1.1310us 678ns 1.7230us cudaGetDeviceCount
0.00% 4.3320us 2 2.1660us 1.2450us 3.0870us cuDriverGetVersion
MXNet Profiler API result for 1 epoch
cu100mkl
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: cpu/0 193 81507.3906 26000.0000 328184.1875 151092.0938
Memory: gpu/0 66993 486890.5625 7.6800 486936.6250 243464.4844
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArraySyncCopyFromCPU 24 165.8120 0.0640 89.2050 6.9088
MXNDArrayReshape64 2730 11.1690 0.0020 0.0400 0.0041
MXExecutorForward 1422 1260.3890 0.1420 11.0660 0.8863
MXExecutorBackwardEx 1308 1223.9100 0.1730 9.1420 0.9357
MXNDArraySlice 5574 13.2250 0.0000 0.0500 0.0024
MXNDArrayFree 41162 366.8830 0.0000 0.2150 0.0089
MXNDArrayGetContext 10958 10.5630 0.0000 0.1010 0.0010
MXSymbolListAttr 3 5.6350 0.9650 2.7530 1.8783
MXImperativeInvokeEx 50115 1389.6400 0.0070 1.7010 0.0277
MXNDArrayGetDType 11015 10.1360 0.0000 0.0400 0.0009
MXNet C API Calls 195096 195.0960 0.0010 195.0960 97.5475
MXNet C API Concurrency 390192 0.0000 0.0000 0.0010 0.0005
MXExecutorSimpleBind 6 3734.5410 15.9950 3542.6250 622.4235
MXSymbolCreateAtomicSymbol 5736 36.1470 0.0000 0.1140 0.0063
MXNDArraySyncCopyToCPU 5460 80959.1719 0.0850 363.2810 14.8277
MXSymbolGetAttr 6 0.0140 0.0010 0.0040 0.0023
MXNDArrayGetStorageType 171 0.1540 0.0000 0.0050 0.0009
MXNDArrayCreateEx 8258 73.3280 0.0010 3.3300 0.0089
MXExecutorOutputs 6 0.0160 0.0020 0.0030 0.0027
MXExecutorPrint 6 28.1480 1.7110 10.1980 4.6913
MXNDArrayGetShape 51136 52.0260 0.0000 0.0440 0.0010
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_plus_scalar 22 24.1390 0.0160 4.5600 1.0972
CopyGPU2CPU 22 176.3340 0.0570 20.1660 8.0152
SoftmaxOutput 2844 812.7720 0.1220 1.4340 0.2858
broadcast_mul 5460 210.2000 0.0170 0.6040 0.0385
broadcast_add 5460 208.4420 0.0150 0.8420 0.0382
log 5460 159.6020 0.0150 0.4100 0.0292
add_n 266816 9588.1748 0.0110 2.0040 0.0359
_backward_Activation 693200 13628.1191 0.0120 2.3540 0.0197
_backward_mul 415920 9997.9492 0.0110 2.0910 0.0240
_backward_copy 71936 479.4360 0.0030 0.4750 0.0067
sgd_update 28776 6515.3848 0.0130 2.8010 0.2264
_backward_FullyConnected 279896 37841.1094 0.0860 2.1650 0.1352
Reshape 5688 146.5510 0.0030 1.4950 0.0258
sum 10920 549.3320 0.0250 0.7170 0.0503
_zeros 11398 529.2270 0.0100 8.5110 0.0464
Activation 752600 21769.1758 0.0120 3.0950 0.0289
_equal_scalar 5460 224.6300 0.0160 1.3360 0.0411
_rminus_scalar 5460 233.0980 0.0190 0.2030 0.0427
CopyCPU2CPU 88 304.1030 0.0050 27.1770 3.4557
SetupExec 22932 48.5340 0.0000 0.0970 0.0021
Concat 2844 431.9350 0.0450 1.2210 0.1519
_full 10 0.0810 0.0020 0.0160 0.0081
_backward_SliceChannel 141256 5605.5112 0.0230 1.3650 0.0397
_slice_assign_scalar 4 0.2660 0.0190 0.1140 0.0665
pick 5460 282.8300 0.0180 3.2620 0.0518
_copy 20928 1165.9010 0.0150 1.3830 0.0557
_random_uniform 12 74.5140 2.9510 12.7100 6.2095
_backward_SoftmaxOutput 2616 578.9200 0.0920 1.2720 0.2213
SyncCopyGPU2CPU 10920 582.0080 0.0200 0.1760 0.0533
ResourceParallelRandomSetSeed 2 19.5030 9.7440 9.7590 9.7515
elemwise_mul 451560 9896.3320 0.0100 2.4910 0.0219
FullyConnected 303884 21313.8770 0.0450 4.1210 0.0701
_maximum_scalar 5460 193.6250 0.0140 1.3690 0.0355
DeleteVariable 65696 385.6180 0.0010 1.4810 0.0059
elemwise_add 301040 5272.2681 0.0100 1.7210 0.0175
SliceChannel 153364 6224.5142 0.0230 12.3770 0.0406
_div_scalar 22 22.9760 0.0100 5.3440 1.0444
expand_dims 75260 1019.0590 0.0030 1.1830 0.0135
_backward_Concat 2616 539.6740 0.0490 32.9550 0.2063
CopyCPU2GPU 11192 781.9070 0.0130 5.8030 0.0699
SetValueOp 48 3.2530 0.0500 0.1220 0.0678
Embedding 2844 161.1000 0.0190 1.7600 0.0566
_copyto_GPU2GPU 10920 567.0010 0.0130 0.2170 0.0519
_backward_Embedding 2616 1285.4290 0.2150 1.8690 0.4914
WaitForVar 10968 89.9160 0.0020 0.3290 0.0082
cu92mkl
Profile Statistics.
Note that counter items are counter values and not time units.
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: cpu/0 193 81094.4297 26000.0000 328184.1875 151092.0938
Memory: gpu/0 66993 486890.5625 7.6800 486936.6250 243464.4844
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArraySyncCopyFromCPU 24 176.6150 0.0330 69.3230 7.3590
MXNDArrayReshape64 2730 12.2640 0.0020 0.0410 0.0045
MXExecutorForward 1422 1316.2700 0.1480 14.0690 0.9256
MXExecutorBackwardEx 1308 1286.3680 0.1690 5.8390 0.9835
MXNDArraySlice 5574 13.4970 0.0000 0.0590 0.0024
MXNDArrayFree 41162 396.0320 0.0000 3.7350 0.0096
MXNDArrayGetContext 10958 10.7350 0.0000 0.0380 0.0010
MXSymbolListAttr 3 4.0050 1.2550 1.4000 1.3350
MXImperativeInvokeEx 50115 1499.5780 0.0070 2.2110 0.0299
MXNDArrayGetDType 11015 9.8290 0.0000 0.0370 0.0009
MXNet C API Calls 195096 195.0960 0.0010 195.0960 97.5475
MXNet C API Concurrency 390192 0.0000 0.0000 0.0010 0.0005
MXExecutorSimpleBind 6 3615.5139 8.9550 3481.5950 602.5857
MXSymbolCreateAtomicSymbol 5736 30.6770 0.0010 0.1260 0.0053
MXNDArraySyncCopyToCPU 5460 72482.7344 0.0850 249.1380 13.2752
MXSymbolGetAttr 6 0.0140 0.0010 0.0030 0.0023
MXNDArrayGetStorageType 171 0.1090 0.0000 0.0030 0.0006
MXNDArrayCreateEx 8258 67.8010 0.0010 3.1960 0.0082
MXExecutorOutputs 6 0.0150 0.0020 0.0040 0.0025
MXExecutorPrint 6 25.9240 0.9360 12.1280 4.3207
MXNDArrayGetShape 51136 55.6040 0.0000 0.0500 0.0011
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_plus_scalar 22 47.2760 0.0080 10.5180 2.1489
CopyGPU2CPU 22 156.4810 0.0340 20.2620 7.1128
broadcast_add 5460 198.4270 0.0160 3.6910 0.0363
_backward_Activation 693200 13550.5508 0.0130 1.9900 0.0195
_backward_mul 415920 10045.2979 0.0110 2.0190 0.0242
_backward_copy 71936 495.9330 0.0030 0.6440 0.0069
SoftmaxOutput 2844 806.0880 0.1250 0.6040 0.2834
Reshape 5688 145.6910 0.0030 1.0440 0.0256
add_n 266816 9710.6777 0.0120 1.8540 0.0364
log 5460 162.8830 0.0150 2.2690 0.0298
broadcast_mul 5460 218.6930 0.0170 1.6100 0.0401
sgd_update 28776 6457.4722 0.0130 2.4300 0.2244
_backward_FullyConnected 279896 23458.5215 0.0410 4.9550 0.0838
_backward_Concat 2616 520.1320 0.0490 24.3040 0.1988
CopyCPU2GPU 11192 753.7490 0.0140 7.5330 0.0673
_backward_Embedding 2616 1299.2860 0.2160 5.2340 0.4967
WaitForVar 10968 92.5920 0.0020 0.2480 0.0084
_rminus_scalar 5460 209.0210 0.0170 1.5620 0.0383
CopyCPU2CPU 88 308.8270 0.0040 40.2790 3.5094
SetupExec 22932 59.5050 0.0000 3.3250 0.0026
Concat 2844 438.0390 0.0470 0.6820 0.1540
_full 10 4.8940 0.0070 2.4050 0.4894
_backward_SliceChannel 141256 5496.1040 0.0230 3.3710 0.0389
_slice_assign_scalar 4 9.3150 0.0330 4.6260 2.3287
pick 5460 291.0040 0.0180 5.0680 0.0533
_copy 20928 1201.1370 0.0160 0.7570 0.0574
_random_uniform 12 97.6060 2.5460 23.4520 8.1338
sum 10920 555.5390 0.0260 4.9930 0.0509
_zeros 11398 543.3210 0.0110 14.9830 0.0477
_equal_scalar 5460 238.2550 0.0170 9.3850 0.0436
Activation 752600 21566.7363 0.0120 3.8930 0.0287
ResourceParallelRandomSetSeed 2 19.7140 9.8530 9.8610 9.8570
Embedding 2844 169.3260 0.0180 7.2920 0.0595
SetValueOp 48 4.7550 0.0320 0.9550 0.0991
_copyto_GPU2GPU 10920 581.7290 0.0140 11.1590 0.0533
SliceChannel 153364 6120.6851 0.0230 9.5310 0.0399
FullyConnected 303884 19919.7012 0.0440 2.4720 0.0656
elemwise_mul 451560 9846.6777 0.0110 2.3670 0.0218
_maximum_scalar 5460 186.1150 0.0150 1.6400 0.0341
DeleteVariable 65696 389.9880 0.0010 3.7150 0.0059
elemwise_add 301040 5299.9009 0.0110 1.6230 0.0176
SyncCopyGPU2CPU 10920 570.1200 0.0240 0.3210 0.0522
_backward_SoftmaxOutput 2616 576.9860 0.0930 2.0890 0.2206
_div_scalar 22 27.2660 0.0070 7.1150 1.2394
expand_dims 75260 835.0670 0.0030 0.1240 0.0111