Performance Regression on CUDA10

## Description
I have observed performance regression on LSTM model. The performance drop from around 950 samples/ sec to around 780 samples/sec. I compared the result with mxnet-cu100mkl and mxnet-cu92mkl. Both of them are 1.4.0 without using nightly build.

## Environment info (Required)

```
Deep Learning AMI (Ubuntu) Version 22.0 with P3.16xlarge
```

## Steps to reproduce
(Paste the commands you ran that produced the error.)

1.  lanuch a DL ami or DL base ami with p3.16xlarge(it will only use 1 gpu)
2. sudo apt-get install python3-pip && sudo pip3 install virtualenv 
3. python3 -m venv mxnet-cu100mkl/mxnet-cu92mkl 
4. source mxnet-cu100mkl/bin/activate or mxnet-cu92mkl/bin/activate
5. pip2 install psutil --user && pip2 install --upgrade pandas --user && pip2 install mxnet-cu100mkl or mxnet-cu92mkl
6. git clone https://github.com/awslabs/deeplearning-benchmark.git 
7. cd deeplearning-benchmark
8. python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local 
9. use sudo rm /usr/local/cuda/ sudo ln -s /usr/local/cuda-9.2 /usr/local/cuda to switch cuda version 

Sample result
**cu100mkl** result: 
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 340.001303, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 632.3600240369404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 787.0409230769234, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 534.887, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1131.2, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 796, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 361.069259, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 4132.82, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1414.0} 

**cu92mkl** result: 
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 328.856996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 633.2544512279404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 916.6658846153846, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 458.66499999999996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1132.8, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 736, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 351.549512, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 9839.61, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1416.0} 

More data points 
G3dn.16xlarge -cu100mkl aveage throughtput: 445 
G3dn.16xlarge -cu92mkl average throughput: 444 
On G3dn(Tesla M60) I don't see the performance regression
P3.16xlarge -cu100 aveage throughtput: 779
P3.16xlarge -cu92 aveage throughtput: 930

Update suggested by @KellenSunderland 
nvprof --profile-child-processes -s python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 1 --metrics-suffix test --kvstore local > nvprof_output_cu100mkl 2>&1

****
cu100mkl
```
 ==22332== NVPROF is profiling process 22332, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Warning: 29 API trace records have same start and end timestamps.
This can happen because of short execution duration of CUDA APIs and low timer resolution on the underlying operating system.
==22332== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   32.96%  11.4365s    139948  81.719us  60.160us  1.9775ms  volta_sgemm_128x64_nt
                   26.94%  9.34508s    136024  68.701us  63.808us  83.135us  volta_sgemm_32x32_sliced1x4_nn
                   11.46%  3.97589s    150520  26.414us  24.319us  34.656us  volta_sgemm_64x32_sliced1x4_tn
                    3.53%  1.22311s      1308  935.10us  361.25us  2.0452ms  volta_sgemm_128x64_nn
                    2.30%  797.69ms    373330  2.1360us  2.0150us  14.400us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.13%  739.36ms    349570  2.1150us  1.9840us  14.464us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.13%  738.49ms       687  1.0750ms  980.92us  1.2799ms  volta_sgemm_32x128_tn
                    1.63%  564.77ms      1308  431.78us  163.55us  972.06us  void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
                    1.40%  484.33ms    133408  3.6300us  2.8800us  16.319us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    1.32%  456.59ms    151933  3.0050us  1.7920us  164.10us  void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    1.26%  435.71ms    291784  1.4930us  1.3750us  14.048us  [CUDA memset]
                    1.16%  402.49ms       566  711.11us  570.72us  1.7457ms  volta_sgemm_128x64_tn
                    1.14%  396.27ms     14388  27.541us  1.3760us  96.448us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
                    1.08%  376.09ms      1422  264.48us  110.75us  579.74us  void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
                    1.02%  352.83ms    225780  1.5620us  1.5030us  15.008us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.00%  345.93ms    207960  1.6630us  1.5990us  13.696us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.91%  316.27ms    207960  1.5200us  1.3750us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    0.87%  300.73ms    205344  1.4640us  1.3110us  13.472us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    0.83%  288.70ms    228510  1.2630us     992ns  13.760us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
                    0.75%  260.61ms    150520  1.7310us  1.5680us  14.080us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.68%  236.88ms    138640  1.7080us  1.6310us  13.664us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.67%  230.86ms    153250  1.5060us     992ns  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
                    0.61%  213.10ms    133408  1.5970us  1.3750us  14.368us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
                    0.38%  131.73ms     14616  9.0120us  1.0240us  108.19us  [CUDA memcpy DtoD]
                    0.37%  127.57ms      1308  97.532us  40.031us  214.02us  void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
                    0.33%  112.78ms      1299  86.822us  32.863us  164.67us  void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.28%  96.707ms      6540  14.787us  2.8160us  116.96us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    0.18%  62.853ms       169  371.91us  347.84us  465.53us  volta_sgemm_128x128_tn
                    0.17%  57.555ms      5460  10.541us  9.8240us  13.824us  void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
                    0.15%  50.856ms      1332  38.180us  1.2480us  46.112us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.14%  47.081ms      5471  8.6050us  1.5030us  14.016ms  [CUDA memcpy DtoH]
                    0.12%  42.434ms      5608  7.5660us  1.5030us  5.8883ms  [CUDA memcpy HtoD]
                    0.03%  10.385ms      1422  7.3030us  3.5200us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
                    0.02%  6.6979ms      2730  2.4530us  1.5680us  10.912us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
                    0.02%  6.3452ms      5688  1.1150us     991ns  12.768us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
                    0.01%  4.1586ms      2730  1.5230us  1.1840us  10.592us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
                    0.01%  3.6414ms      2730  1.3330us     992ns  7.9680us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.4687ms      2730  1.2700us  1.0560us  13.024us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
                    0.01%  3.2744ms      2730  1.1990us     960ns  14.112us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.0465ms      2730  1.1150us     992ns  13.184us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  2.1137ms         9  234.85us  232.29us  236.06us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
                    0.01%  2.1085ms         9  234.28us  232.29us  236.29us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
      API calls:   56.38%  44.7969s   3145400  14.242us  6.1280us  23.314ms  cudaLaunchKernel
                   19.23%  15.2759s    336467  45.400us  3.7720us  4.0563ms  cudaStreamSynchronize
                    5.44%  4.32640s      1503  2.8785ms  155.14us  3.87303s  cudaMemGetInfo
                    4.45%  3.53308s    291784  12.108us  4.8700us  95.429ms  cudaMemsetAsync
                    3.03%  2.40724s        18  133.74ms     884ns  635.37ms  cudaFree
                    2.67%  2.12090s    291776  7.2680us  2.0380us  2.3136ms  cudaEventQuery
                    2.35%  1.86598s   1993278     936ns       0ns  1.0557ms  cudaPeekAtLastError
                    2.23%  1.77185s     25683  68.989us  8.3060us  23.490ms  cudaMemcpy2DAsync
                    1.38%  1.10031s   1152674     954ns       0ns  975.49us  cudaGetLastError
                    1.10%  877.91ms    291778  3.0080us     981ns  12.349ms  cudaEventRecord
                    0.89%  704.22ms    288598  2.4400us     624ns  10.880ms  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.42%  333.73ms    166750  2.0010us      73ns  1.1830ms  cudaSetDevice
                    0.24%  194.41ms     67011  2.9010us     365ns  1.2253ms  cudaGetDevice
                    0.12%  93.806ms      1547  60.637us  10.066us  2.2598ms  cudaMalloc
                    0.04%  30.262ms      2256  13.413us     467ns  1.6574ms  cuDeviceGetAttribute
                    0.01%  9.0654ms        24  377.73us  239.21us  574.83us  cuDeviceTotalMem
                    0.01%  5.5531ms         4  1.3883ms  954.64us  1.7572ms  cudaGetDeviceProperties
                    0.00%  2.2392ms        24  93.299us  71.233us  155.01us  cuDeviceGetName
                    0.00%  1.5461ms       912  1.6950us     518ns  73.179us  cudaFuncSetAttribute
                    0.00%  1.4418ms       216  6.6750us     830ns  383.84us  cudaEventCreateWithFlags
                    0.00%  1.3181ms         2  659.06us  35.784us  1.2823ms  cudaHostAlloc
                    0.00%  841.92us        12  70.159us  17.552us  539.11us  cudaMemcpy
                    0.00%  812.45us        22  36.929us  13.082us  92.853us  cudaStreamCreateWithFlags
                    0.00%  610.40us         4  152.60us  62.885us  301.45us  cudaStreamCreate
                    0.00%  481.01us         8  60.126us  13.224us  343.28us  cudaStreamCreateWithPriority
                    0.00%  277.53us       202  1.3730us     658ns  23.334us  cudaDeviceGetAttribute
                    0.00%  48.017us        32  1.5000us     635ns  4.4500us  cuDeviceGet
                    0.00%  28.351us         8  3.5430us  2.8770us  5.5370us  cuDeviceGetPCIBusId
                    0.00%  21.491us         6  3.5810us  1.7350us  6.8090us  cudaEventCreate
                    0.00%  21.416us        24     892ns     522ns  1.5600us  cuDeviceGetUuid
                    0.00%  6.6090us         2  3.3040us  3.0640us  3.5450us  cudaHostGetDevicePointer
                    0.00%  6.2010us         5  1.2400us     631ns  2.1050us  cuDeviceGetCount
                    0.00%  4.8960us         4  1.2240us     709ns  2.1750us  cudaGetDeviceCount
                    0.00%  4.5440us         2  2.2720us  2.0340us  2.5100us  cudaDeviceGetStreamPriorityRange
                    0.00%  4.3850us         2  2.1920us  1.7190us  2.6660us  cuInit
                    0.00%  3.8450us         2  1.9220us  1.3320us  2.5130us  cuDriverGetVersion

```

**cu92mkl**
```
==21941== NVPROF is profiling process 21941, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Warning: Found 55 invalid records in the result.
==21941== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==21941== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.70%  9.45188s    136024  69.486us  65.248us  87.616us  volta_sgemm_32x32_sliced1x4_nn
                   16.70%  4.68282s    139948  33.461us  17.664us  2.2807ms  volta_sgemm_128x64_nt
                   15.80%  4.43089s    150520  29.437us  25.984us  39.232us  volta_sgemm_32x32_sliced1x4_tn
                    4.36%  1.22354s      1308  935.43us  361.79us  2.3943ms  volta_sgemm_128x64_nn
                    4.17%  1.16848s      1422  821.72us  346.91us  1.9999ms  volta_sgemm_128x128_tn
                    2.85%  800.01ms    373330  2.1420us  2.0150us  13.888us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.66%  745.25ms    349570  2.1310us  1.9840us  14.208us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.02%  565.53ms      1308  432.36us  163.46us  1.0837ms  void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
                    1.70%  477.35ms    133408  3.5780us  2.9120us  14.623us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    1.63%  456.01ms    151933  3.0010us  2.0150us  163.81us  void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    1.41%  395.84ms     14388  27.511us  1.3120us  99.583us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
                    1.33%  371.83ms      1422  261.48us  109.66us  580.57us  void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
                    1.26%  352.15ms    225780  1.5590us  1.5030us  13.088us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.24%  346.55ms    207960  1.6660us  1.5990us  12.320us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.13%  315.67ms    207960  1.5170us  1.3750us  13.888us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    1.07%  300.31ms    205344  1.4620us  1.3110us  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    1.03%  289.23ms    228510  1.2650us     992ns  13.600us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
                    0.92%  257.58ms    150520  1.7110us  1.5360us  13.344us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.84%  235.80ms    138640  1.7000us  1.5990us  13.792us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.82%  229.41ms    153250  1.4960us     992ns  13.728us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
                    0.80%  224.70ms    133408  1.6840us  1.4710us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
                    0.47%  132.46ms     14616  9.0620us  1.0880us  120.67us  [CUDA memcpy DtoD]
                    0.45%  127.20ms      1308  97.248us  40.288us  213.15us  void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
                    0.40%  112.83ms      1299  86.860us  32.864us  164.86us  void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.37%  104.37ms      6540  15.958us  3.2640us  120.22us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    0.21%  58.328ms      5460  10.682us  10.080us  17.088us  void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
                    0.18%  50.957ms      1332  38.256us  1.2480us  44.320us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.17%  46.791ms      5471  8.5520us  1.5030us  13.900ms  [CUDA memcpy DtoH]
                    0.16%  44.861ms      5608  7.9990us  1.5040us  5.6511ms  [CUDA memcpy HtoD]
                    0.04%  10.481ms      1422  7.3700us  3.3600us  14.272us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
                    0.03%  7.1928ms      2730  2.6340us  2.1430us  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
                    0.02%  6.2377ms      5688  1.0960us     991ns  9.6000us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
                    0.01%  4.1941ms      2730  1.5360us  1.1840us  12.704us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
                    0.01%  3.6781ms      2730  1.3470us     992ns  7.9680us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.4240ms      2730  1.2540us  1.0560us  8.5440us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
                    0.01%  3.2767ms      2730  1.2000us     992ns  11.456us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.0875ms      2730  1.1300us     992ns  11.648us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  2.8087ms      1821  1.5420us  1.3760us  13.760us  [CUDA memset]
                    0.01%  2.1258ms         9  236.20us  234.02us  239.55us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
                    0.01%  2.1201ms         9  235.57us  234.24us  238.05us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
      API calls:   63.12%  42.1805s   3145400  13.410us  5.3740us  25.822ms  cudaLaunchKernel
                   18.65%  12.4655s    336467  37.048us  3.8090us  3.6356ms  cudaStreamSynchronize
                    6.36%  4.24941s      1498  2.8367ms  154.77us  3.89087s  cudaMemGetInfo
                    3.36%  2.24470s        18  124.71ms     728ns  566.12ms  cudaFree
                    2.92%  1.95056s   1993245     978ns       2ns  1.4790ms  cudaPeekAtLastError
                    2.70%  1.80358s     25683  70.224us  8.0210us  20.948ms  cudaMemcpy2DAsync
                    1.76%  1.17849s   1152100  1.0220us       3ns  1.1467ms  cudaGetLastError
                    0.49%  327.96ms    166750  1.9660us      68ns  993.66us  cudaSetDevice
                    0.29%  191.04ms     67011  2.8500us     221ns  910.93us  cudaGetDevice
                    0.16%  110.12ms      1542  71.413us  9.3680us  18.735ms  cudaMalloc
                    0.07%  48.392ms      1821  26.574us  9.4630us  9.4157ms  cudaMemsetAsync
                    0.04%  27.123ms      2256  12.022us     458ns  575.77us  cuDeviceGetAttribute
                    0.02%  16.629ms      1813  9.1710us  4.4710us  62.413us  cudaEventQuery
                    0.01%  8.4049ms        24  350.20us  240.68us  575.37us  cuDeviceTotalMem
                    0.01%  7.6088ms      1815  4.1920us  2.1630us  44.168us  cudaEventRecord
                    0.01%  5.3368ms         4  1.3342ms  965.25us  1.4861ms  cudaGetDeviceProperties
                    0.00%  2.1962ms        24  91.507us  69.921us  158.19us  cuDeviceGetName
                    0.00%  2.1156ms       216  9.7940us     833ns  1.2832ms  cudaEventCreateWithFlags
                    0.00%  1.2896ms         2  644.81us  40.351us  1.2493ms  cudaHostAlloc
                    0.00%  1.1971ms         4  299.28us  51.603us  761.62us  cudaStreamCreate
                    0.00%  841.37us        12  70.113us  13.792us  535.50us  cudaMemcpy
                    0.00%  799.59us       384  2.0820us     772ns  97.788us  cudaFuncSetAttribute
                    0.00%  739.83us        22  33.628us  14.586us  100.29us  cudaStreamCreateWithFlags
                    0.00%  473.78us         8  59.223us  13.159us  335.79us  cudaStreamCreateWithPriority
                    0.00%  247.13us       202  1.2230us     642ns  3.6050us  cudaDeviceGetAttribute
                    0.00%  99.833us        32  3.1190us     527ns  16.689us  cuDeviceGet
                    0.00%  21.441us         6  3.5730us  1.7490us  6.7460us  cudaEventCreate
                    0.00%  21.325us         8  2.6650us  2.1230us  4.6420us  cuDeviceGetPCIBusId
                    0.00%  6.8890us         2  3.4440us  3.3850us  3.5040us  cudaHostGetDevicePointer
                    0.00%  4.9550us         5     991ns     470ns  1.4300us  cuDeviceGetCount
                    0.00%  4.8480us         2  2.4240us  2.0320us  2.8160us  cudaDeviceGetStreamPriorityRange
                    0.00%  4.5770us         2  2.2880us  1.6370us  2.9400us  cuInit
                    0.00%  4.5240us         4  1.1310us     678ns  1.7230us  cudaGetDeviceCount
                    0.00%  4.3320us         2  2.1660us  1.2450us  3.0870us  cuDriverGetVersion

```
MXNet Profiler API result for 1 epoch
**cu100mkl**
```
Profile Statistics.
	Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                         193       81507.3906       26000.0000      328184.1875      151092.0938
Memory: gpu/0                       66993      486890.5625           7.6800      486936.6250      243464.4844

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU               24         165.8120           0.0640          89.2050           6.9088
MXNDArrayReshape64                   2730          11.1690           0.0020           0.0400           0.0041
MXExecutorForward                    1422        1260.3890           0.1420          11.0660           0.8863
MXExecutorBackwardEx                 1308        1223.9100           0.1730           9.1420           0.9357
MXNDArraySlice                       5574          13.2250           0.0000           0.0500           0.0024
MXNDArrayFree                       41162         366.8830           0.0000           0.2150           0.0089
MXNDArrayGetContext                 10958          10.5630           0.0000           0.1010           0.0010
MXSymbolListAttr                        3           5.6350           0.9650           2.7530           1.8783
MXImperativeInvokeEx                50115        1389.6400           0.0070           1.7010           0.0277
MXNDArrayGetDType                   11015          10.1360           0.0000           0.0400           0.0009
MXNet C API Calls                  195096         195.0960           0.0010         195.0960          97.5475
MXNet C API Concurrency            390192           0.0000           0.0000           0.0010           0.0005
MXExecutorSimpleBind                    6        3734.5410          15.9950        3542.6250         622.4235
MXSymbolCreateAtomicSymbol            5736          36.1470           0.0000           0.1140           0.0063
MXNDArraySyncCopyToCPU               5460       80959.1719           0.0850         363.2810          14.8277
MXSymbolGetAttr                         6           0.0140           0.0010           0.0040           0.0023
MXNDArrayGetStorageType               171           0.1540           0.0000           0.0050           0.0009
MXNDArrayCreateEx                    8258          73.3280           0.0010           3.3300           0.0089
MXExecutorOutputs                       6           0.0160           0.0020           0.0030           0.0027
MXExecutorPrint                         6          28.1480           1.7110          10.1980           4.6913
MXNDArrayGetShape                   51136          52.0260           0.0000           0.0440           0.0010

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_plus_scalar                           22          24.1390           0.0160           4.5600           1.0972
CopyGPU2CPU                            22         176.3340           0.0570          20.1660           8.0152
SoftmaxOutput                        2844         812.7720           0.1220           1.4340           0.2858
broadcast_mul                        5460         210.2000           0.0170           0.6040           0.0385
broadcast_add                        5460         208.4420           0.0150           0.8420           0.0382
log                                  5460         159.6020           0.0150           0.4100           0.0292
add_n                              266816        9588.1748           0.0110           2.0040           0.0359
_backward_Activation               693200       13628.1191           0.0120           2.3540           0.0197
_backward_mul                      415920        9997.9492           0.0110           2.0910           0.0240
_backward_copy                      71936         479.4360           0.0030           0.4750           0.0067
sgd_update                          28776        6515.3848           0.0130           2.8010           0.2264
_backward_FullyConnected           279896       37841.1094           0.0860           2.1650           0.1352
Reshape                              5688         146.5510           0.0030           1.4950           0.0258
sum                                 10920         549.3320           0.0250           0.7170           0.0503
_zeros                              11398         529.2270           0.0100           8.5110           0.0464
Activation                         752600       21769.1758           0.0120           3.0950           0.0289
_equal_scalar                        5460         224.6300           0.0160           1.3360           0.0411
_rminus_scalar                       5460         233.0980           0.0190           0.2030           0.0427
CopyCPU2CPU                            88         304.1030           0.0050          27.1770           3.4557
SetupExec                           22932          48.5340           0.0000           0.0970           0.0021
Concat                               2844         431.9350           0.0450           1.2210           0.1519
_full                                  10           0.0810           0.0020           0.0160           0.0081
_backward_SliceChannel             141256        5605.5112           0.0230           1.3650           0.0397
_slice_assign_scalar                    4           0.2660           0.0190           0.1140           0.0665
pick                                 5460         282.8300           0.0180           3.2620           0.0518
_copy                               20928        1165.9010           0.0150           1.3830           0.0557
_random_uniform                        12          74.5140           2.9510          12.7100           6.2095
_backward_SoftmaxOutput              2616         578.9200           0.0920           1.2720           0.2213
SyncCopyGPU2CPU                     10920         582.0080           0.0200           0.1760           0.0533
ResourceParallelRandomSetSeed               2          19.5030           9.7440           9.7590           9.7515
elemwise_mul                       451560        9896.3320           0.0100           2.4910           0.0219
FullyConnected                     303884       21313.8770           0.0450           4.1210           0.0701
_maximum_scalar                      5460         193.6250           0.0140           1.3690           0.0355
DeleteVariable                      65696         385.6180           0.0010           1.4810           0.0059
elemwise_add                       301040        5272.2681           0.0100           1.7210           0.0175
SliceChannel                       153364        6224.5142           0.0230          12.3770           0.0406
_div_scalar                            22          22.9760           0.0100           5.3440           1.0444
expand_dims                         75260        1019.0590           0.0030           1.1830           0.0135
_backward_Concat                     2616         539.6740           0.0490          32.9550           0.2063
CopyCPU2GPU                         11192         781.9070           0.0130           5.8030           0.0699
SetValueOp                             48           3.2530           0.0500           0.1220           0.0678
Embedding                            2844         161.1000           0.0190           1.7600           0.0566
_copyto_GPU2GPU                     10920         567.0010           0.0130           0.2170           0.0519
_backward_Embedding                  2616        1285.4290           0.2150           1.8690           0.4914
WaitForVar                          10968          89.9160           0.0020           0.3290           0.0082
```
**cu92mkl**
```
Profile Statistics.
	Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                         193       81094.4297       26000.0000      328184.1875      151092.0938
Memory: gpu/0                       66993      486890.5625           7.6800      486936.6250      243464.4844

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU               24         176.6150           0.0330          69.3230           7.3590
MXNDArrayReshape64                   2730          12.2640           0.0020           0.0410           0.0045
MXExecutorForward                    1422        1316.2700           0.1480          14.0690           0.9256
MXExecutorBackwardEx                 1308        1286.3680           0.1690           5.8390           0.9835
MXNDArraySlice                       5574          13.4970           0.0000           0.0590           0.0024
MXNDArrayFree                       41162         396.0320           0.0000           3.7350           0.0096
MXNDArrayGetContext                 10958          10.7350           0.0000           0.0380           0.0010
MXSymbolListAttr                        3           4.0050           1.2550           1.4000           1.3350
MXImperativeInvokeEx                50115        1499.5780           0.0070           2.2110           0.0299
MXNDArrayGetDType                   11015           9.8290           0.0000           0.0370           0.0009
MXNet C API Calls                  195096         195.0960           0.0010         195.0960          97.5475
MXNet C API Concurrency            390192           0.0000           0.0000           0.0010           0.0005
MXExecutorSimpleBind                    6        3615.5139           8.9550        3481.5950         602.5857
MXSymbolCreateAtomicSymbol            5736          30.6770           0.0010           0.1260           0.0053
MXNDArraySyncCopyToCPU               5460       72482.7344           0.0850         249.1380          13.2752
MXSymbolGetAttr                         6           0.0140           0.0010           0.0030           0.0023
MXNDArrayGetStorageType               171           0.1090           0.0000           0.0030           0.0006
MXNDArrayCreateEx                    8258          67.8010           0.0010           3.1960           0.0082
MXExecutorOutputs                       6           0.0150           0.0020           0.0040           0.0025
MXExecutorPrint                         6          25.9240           0.9360          12.1280           4.3207
MXNDArrayGetShape                   51136          55.6040           0.0000           0.0500           0.0011

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_plus_scalar                           22          47.2760           0.0080          10.5180           2.1489
CopyGPU2CPU                            22         156.4810           0.0340          20.2620           7.1128
broadcast_add                        5460         198.4270           0.0160           3.6910           0.0363
_backward_Activation               693200       13550.5508           0.0130           1.9900           0.0195
_backward_mul                      415920       10045.2979           0.0110           2.0190           0.0242
_backward_copy                      71936         495.9330           0.0030           0.6440           0.0069
SoftmaxOutput                        2844         806.0880           0.1250           0.6040           0.2834
Reshape                              5688         145.6910           0.0030           1.0440           0.0256
add_n                              266816        9710.6777           0.0120           1.8540           0.0364
log                                  5460         162.8830           0.0150           2.2690           0.0298
broadcast_mul                        5460         218.6930           0.0170           1.6100           0.0401
sgd_update                          28776        6457.4722           0.0130           2.4300           0.2244
_backward_FullyConnected           279896       23458.5215           0.0410           4.9550           0.0838
_backward_Concat                     2616         520.1320           0.0490          24.3040           0.1988
CopyCPU2GPU                         11192         753.7490           0.0140           7.5330           0.0673
_backward_Embedding                  2616        1299.2860           0.2160           5.2340           0.4967
WaitForVar                          10968          92.5920           0.0020           0.2480           0.0084
_rminus_scalar                       5460         209.0210           0.0170           1.5620           0.0383
CopyCPU2CPU                            88         308.8270           0.0040          40.2790           3.5094
SetupExec                           22932          59.5050           0.0000           3.3250           0.0026
Concat                               2844         438.0390           0.0470           0.6820           0.1540
_full                                  10           4.8940           0.0070           2.4050           0.4894
_backward_SliceChannel             141256        5496.1040           0.0230           3.3710           0.0389
_slice_assign_scalar                    4           9.3150           0.0330           4.6260           2.3287
pick                                 5460         291.0040           0.0180           5.0680           0.0533
_copy                               20928        1201.1370           0.0160           0.7570           0.0574
_random_uniform                        12          97.6060           2.5460          23.4520           8.1338
sum                                 10920         555.5390           0.0260           4.9930           0.0509
_zeros                              11398         543.3210           0.0110          14.9830           0.0477
_equal_scalar                        5460         238.2550           0.0170           9.3850           0.0436
Activation                         752600       21566.7363           0.0120           3.8930           0.0287
ResourceParallelRandomSetSeed               2          19.7140           9.8530           9.8610           9.8570
Embedding                            2844         169.3260           0.0180           7.2920           0.0595
SetValueOp                             48           4.7550           0.0320           0.9550           0.0991
_copyto_GPU2GPU                     10920         581.7290           0.0140          11.1590           0.0533
SliceChannel                       153364        6120.6851           0.0230           9.5310           0.0399
FullyConnected                     303884       19919.7012           0.0440           2.4720           0.0656
elemwise_mul                       451560        9846.6777           0.0110           2.3670           0.0218
_maximum_scalar                      5460         186.1150           0.0150           1.6400           0.0341
DeleteVariable                      65696         389.9880           0.0010           3.7150           0.0059
elemwise_add                       301040        5299.9009           0.0110           1.6230           0.0176
SyncCopyGPU2CPU                     10920         570.1200           0.0240           0.3210           0.0522
_backward_SoftmaxOutput              2616         576.9860           0.0930           2.0890           0.2206
_div_scalar                            22          27.2660           0.0070           7.1150           1.2394
expand_dims                         75260         835.0670           0.0030           0.1240           0.0111
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Regression on CUDA10 #14725

Description

Environment info (Required)

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Regression on CUDA10 #14725

Description

Description

Environment info (Required)

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions