-
Notifications
You must be signed in to change notification settings - Fork 45.5k
Description
I was using slim models with flower dataset in Ubuntu 16.04.
Tensorflow version:1.1.0rc2 from src
git version:
34c738cc6d3badcb22e3f72482536ada29bd0e65
Bazel version:
Build label: 0.4.5
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Mar 16 12:19:38 2017 (1489666778)
Build timestamp: 1489666778
Build timestamp as int: 1489666778
CUDA version: 8.0.44
cuDNN version:5.1.5
GPU: 3GPUs. All of them are GeForce GTX 1080Ti 11GB
Memory: 32GB
I didn't change source code.
with 1 GPU:
TRAIN_DIR=/tmp/train_logs
DATASET_DIR=/home/l/data/flowers
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2
…
(log here is same as running with 3 gpus)
…
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.2313 (0.96 sec/step)
INFO:tensorflow:global step 20: loss = 3.7792 (0.97 sec/step)
INFO:tensorflow:global step 30: loss = 2.9681 (0.96 sec/step)
INFO:tensorflow:global step 40: loss = 3.8321 (0.97 sec/step)
INFO:tensorflow:global step 50: loss = 3.2210 (0.96 sec/step)
...
when I use 3 gpus:
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_resnet_v2 --num_clones=3
2017-04-24 14:26:11.885411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.53GiB
2017-04-24 14:26:11.885472: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5b62c2c0
2017-04-24 14:26:12.131777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.131848: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x5945f2d0
2017-04-24 14:26:12.369331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:09:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-04-24 14:26:12.371583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2
2017-04-24 14:26:12.371596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y Y Y
2017-04-24 14:26:12.371601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1: Y Y Y
2017-04-24 14:26:12.371606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2: Y Y Y
2017-04-24 14:26:12.371615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:05:00.0)
2017-04-24 14:26:12.371622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Graphics Device, pci bus id: 0000:06:00.0)
2017-04-24 14:26:12.371625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Graphics Device, pci bus id: 0000:09:00.0)
INFO:tensorflow:Restoring parameters from /tmp/train_logs/model.ckpt-0
2017-04-24 14:26:17.426353: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:2 for node 'clone_2/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.427748: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:1 for node 'clone_1/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-04-24 14:26:17.429099: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'clone_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /tmp/train_logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 2.9670 (0.98 sec/step)
INFO:tensorflow:global step 20: loss = 2.9945 (0.99 sec/step)
INFO:tensorflow:global step 30: loss = 3.0432 (0.99 sec/step)
INFO:tensorflow:global step 40: loss = 3.0007 (1.04 sec/step)
INFO:tensorflow:global step 50: loss = 2.8072 (1.03 sec/step)
...
I saw "Ignoring device specification" and the training speed didn't change.
This is nvidia-smi output with 3 gpus.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.13 Driver Version: 378.13 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 0000:05:00.0 On | N/A |
| 49% 83C P2 140W / 250W | 10754MiB / 11171MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device Off | 0000:06:00.0 Off | N/A |
| 47% 81C P2 137W / 250W | 10744MiB / 11172MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 Graphics Device Off | 0000:09:00.0 Off | N/A |
| 43% 74C P2 130W / 250W | 10744MiB / 11172MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1065 G /usr/lib/xorg/Xorg 160MiB |
| 0 1757 G compiz 81MiB |
| 0 14407 C python 10497MiB |
| 1 14407 C python 10729MiB |
| 2 14407 C python 10729MiB |
+-----------------------------------------------------------------------------+
Something Else
I tried inception model with 3 gpus and it worked well with speed boost. There was no "Ignoring device specification" in inception model logs. I'm not sure whether it is the problem.
similar problem:
#1338
tensorflow/tensorflow#8061 (I tried the script in TF1.1.0 and "Ignoring device specification" appeared too. If someone needs details,I will post logs.)
I changed model to inception_v3. It seems nothing changed.
I'm also considering if I can output batch content that may be helpful.