Skip to content

Slim multi-gpu performance problems #1390

@Ettard

Description

@Ettard

I was using slim models with flower dataset in Ubuntu 16.04.

Tensorflow version:1.1.0rc2 from src

git version:
34c738cc6d3badcb22e3f72482536ada29bd0e65

Bazel version:
Build label: 0.4.5
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Mar 16 12:19:38 2017 (1489666778)
Build timestamp: 1489666778
Build timestamp as int: 1489666778

CUDA version: 8.0.44
cuDNN version:5.1.5
GPU: 3GPUs. All of them are GeForce GTX 1080Ti 11GB
Memory: 32GB

I didn't change source code.
with 1 GPU:

TRAIN_DIR=/tmp/train_logs
DATASET_DIR=/home/l/data/flowers
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2

(log here is same as running with 3 gpus)

INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.2313 (0.96 sec/step)
INFO:tensorflow:global step 20: loss = 3.7792 (0.97 sec/step)
INFO:tensorflow:global step 30: loss = 2.9681 (0.96 sec/step)
INFO:tensorflow:global step 40: loss = 3.8321 (0.97 sec/step)
INFO:tensorflow:global step 50: loss = 3.2210 (0.96 sec/step)
...

when I use 3 gpus:
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2 --num_clones=3
2017-04-24 14:26:11.885411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.53GiB
2017-04-24 14:26:11.885472: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5b62c2c0
2017-04-24 14:26:12.131777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.131848: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5945f2d0
2017-04-24 14:26:12.369331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:09:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.371583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2
2017-04-24 14:26:12.371596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y Y Y
2017-04-24 14:26:12.371601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1: Y Y Y
2017-04-24 14:26:12.371606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2: Y Y Y
2017-04-24 14:26:12.371615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:05:00.0)
2017-04-24 14:26:12.371622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Graphics Device, pci bus id: 0000:06:00.0)
2017-04-24 14:26:12.371625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Graphics Device, pci bus id: 0000:09:00.0)
INFO:tensorflow:Restoring parameters from /tmp/train_logs/model.ckpt-0
2017-04-24 14:26:17.426353: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:2 for node 'clone_2/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.427748: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:1 for node 'clone_1/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.429099: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'clone_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /tmp/train_logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 2.9670 (0.98 sec/step)
INFO:tensorflow:global step 20: loss = 2.9945 (0.99 sec/step)
INFO:tensorflow:global step 30: loss = 3.0432 (0.99 sec/step)
INFO:tensorflow:global step 40: loss = 3.0007 (1.04 sec/step)
INFO:tensorflow:global step 50: loss = 2.8072 (1.03 sec/step)
...

I saw "Ignoring device specification" and the training speed didn't change.
This is nvidia-smi output with 3 gpus.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.13 Driver Version: 378.13 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 0000:05:00.0 On | N/A |
| 49% 83C P2 140W / 250W | 10754MiB / 11171MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device Off | 0000:06:00.0 Off | N/A |
| 47% 81C P2 137W / 250W | 10744MiB / 11172MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 Graphics Device Off | 0000:09:00.0 Off | N/A |
| 43% 74C P2 130W / 250W | 10744MiB / 11172MiB | 98% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1065 G /usr/lib/xorg/Xorg 160MiB |
| 0 1757 G compiz 81MiB |
| 0 14407 C python 10497MiB |
| 1 14407 C python 10729MiB |
| 2 14407 C python 10729MiB |
+-----------------------------------------------------------------------------+

Something Else
I tried inception model with 3 gpus and it worked well with speed boost. There was no "Ignoring device specification" in inception model logs. I'm not sure whether it is the problem.

similar problem:
#1338
tensorflow/tensorflow#8061 (I tried the script in TF1.1.0 and "Ignoring device specification" appeared too. If someone needs details,I will post logs.)

I changed model to inception_v3. It seems nothing changed.
I'm also considering if I can output batch content that may be helpful.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions