Multi-GPU Inference using FIFO Queues issue in TF v1.0.0

Here is an example multi-gpu inference code which produces incorrect Output in TF v1.0.0 but produces correct output in v0.11.0 (single GPU produces correct output in both versions). Please refer to the attached code for more details.

[InferenceTest.txt](https://github.com/tensorflow/tensorflow/files/817729/InferenceTest.txt)
[Inference.txt](https://github.com/tensorflow/tensorflow/files/817730/Inference.txt)


**Correct Output (TF v0.11) -- 2 GPUs** 
python InferenceTest.py 2
-------------------------
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
**************** Using 2 GPUs *****************
TensorFlow version:  0.11.0
GPU ID List:  [0, 1]
Device:  /gpu:0
Device:  /gpu:1
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x37b9dc0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size=  10
Results Queue Size=  0
Processing: Input Queue Size=  8
Processing: Results Queue Size=  2
Processing: Input Queue Size=  6
Processing: Results Queue Size=  4
Processing: Input Queue Size=  4
Processing: Results Queue Size=  6
Processing: Input Queue Size=  2
Processing: Results Queue Size=  8
Processing: Input Queue Size=  0
Processing: Results Queue Size=  10
************* Dequeue Results ********************
Results Queue Size=  10
Dequeue Results: Results Output  1:
Dequeue Results: Results Queue Size=  9
Dequeue Results: Results Output  2:
Dequeue Results: Results Queue Size=  8
Dequeue Results: Results Output  3:
Dequeue Results: Results Queue Size=  7
Dequeue Results: Results Output  4:
Dequeue Results: Results Queue Size=  6
Dequeue Results: Results Output  6:
Dequeue Results: Results Queue Size=  5
Dequeue Results: Results Output  5:
Dequeue Results: Results Queue Size=  4
Dequeue Results: Results Output  7:
Dequeue Results: Results Queue Size=  3
Dequeue Results: Results Output  8:
Dequeue Results: Results Queue Size=  2
Dequeue Results: Results Output  9:
Dequeue Results: Results Queue Size=  1
Dequeue Results: Results Output  10:
Dequeue Results: Results Queue Size=  0
Final Input Queue Size=  0
Final Results Queue Size=  0

***************************************************************************************************
**Wrong Output (TF v1.0.0) -- 2 GPUs** 
python InferenceTest.py 2
-------------------------
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
**************** Using 2 GPUs *****************
TensorFlow version:  1.0.0
GPU ID List:  [0, 1]
Device:  /gpu:0
Device:  /gpu:1
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x1999d60
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: 
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size=  10
Results Queue Size=  0
Processing: Input Queue Size=  9
Processing: Results Queue Size=  2
Processing: Input Queue Size=  8
Processing: Results Queue Size=  4
Processing: Input Queue Size=  7
Processing: Results Queue Size=  6
Processing: Input Queue Size=  6
Processing: Results Queue Size=  8
Processing: Input Queue Size=  5
Processing: Results Queue Size=  10
Processing: Input Queue Size=  4
Processing: Results Queue Size=  12
Processing: Input Queue Size=  3
Processing: Results Queue Size=  14
Processing: Input Queue Size=  2
Processing: Results Queue Size=  16
Processing: Input Queue Size=  1
Processing: Results Queue Size=  18
Processing: Input Queue Size=  0
Processing: Results Queue Size=  20
************* Dequeue Results ********************
Results Queue Size=  20
Dequeue Results: Results Output  1:
Dequeue Results: Results Queue Size=  19
Dequeue Results: Results Output  1:
Dequeue Results: Results Queue Size=  18
Dequeue Results: Results Output  2:
Dequeue Results: Results Queue Size=  17
Dequeue Results: Results Output  2:
Dequeue Results: Results Queue Size=  16
Dequeue Results: Results Output  3:
Dequeue Results: Results Queue Size=  15
Dequeue Results: Results Output  3:
Dequeue Results: Results Queue Size=  14
Dequeue Results: Results Output  4:
Dequeue Results: Results Queue Size=  13
Dequeue Results: Results Output  4:
Dequeue Results: Results Queue Size=  12
Dequeue Results: Results Output  5:
Dequeue Results: Results Queue Size=  11
Dequeue Results: Results Output  5:
Dequeue Results: Results Queue Size=  10
Dequeue Results: Results Output  6:
Dequeue Results: Results Queue Size=  9
Dequeue Results: Results Output  6:
Dequeue Results: Results Queue Size=  8
Dequeue Results: Results Output  7:
Dequeue Results: Results Queue Size=  7
Dequeue Results: Results Output  7:
Dequeue Results: Results Queue Size=  6
Dequeue Results: Results Output  8:
Dequeue Results: Results Queue Size=  5
Dequeue Results: Results Output  8:
Dequeue Results: Results Queue Size=  4
Dequeue Results: Results Output  9:
Dequeue Results: Results Queue Size=  3
Dequeue Results: Results Output  9:
Dequeue Results: Results Queue Size=  2
Dequeue Results: Results Output  10:
Dequeue Results: Results Queue Size=  1
Dequeue Results: Results Output  10:
Dequeue Results: Results Queue Size=  0
Final Input Queue Size=  0
Final Results Queue Size=  0

### What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
None

### Environment info
Operating System: CentOS 7.2.1511

Installed version of CUDA and cuDNN: 
(please attach the output of `ls -l /path/to/cuda/lib/libcud*`):
![cudaversion](https://cloud.githubusercontent.com/assets/21690396/23567127/a771fd10-0009-11e7-88bb-daa27a1b28c2.PNG)

If installed from binary pip package, provide:

1. A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp35-cp35m-linux_x86_64.whl 
2. The output from `python -c "import tensorflow; print(tensorflow.__version__)"`.

See the output above for 2 different tensor flow versions

If installed from source, provide 

1. The commit hash (`git rev-parse HEAD`)
2. The output of `bazel version`

### If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)


### What other attempted solutions have you tried?


### Logs or other output that would be helpful
(If logs are large, please upload as attachment or provide link).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU Inference using FIFO Queues issue in TF v1.0.0 #8061

Correct Output (TF v0.11) -- 2 GPUs
python InferenceTest.py 2

Wrong Output (TF v1.0.0) -- 2 GPUs
python InferenceTest.py 2

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

Environment info

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

What other attempted solutions have you tried?

Logs or other output that would be helpful

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU Inference using FIFO Queues issue in TF v1.0.0 #8061

Description

Correct Output (TF v0.11) -- 2 GPUs python InferenceTest.py 2

Wrong Output (TF v1.0.0) -- 2 GPUs python InferenceTest.py 2

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

Environment info

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

What other attempted solutions have you tried?

Logs or other output that would be helpful

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Correct Output (TF v0.11) -- 2 GPUs
python InferenceTest.py 2

Wrong Output (TF v1.0.0) -- 2 GPUs
python InferenceTest.py 2