Skip to content

Multi-GPU Inference using FIFO Queues issue in TF v1.0.0 #8061

@agupta74

Description

@agupta74

Here is an example multi-gpu inference code which produces incorrect Output in TF v1.0.0 but produces correct output in v0.11.0 (single GPU produces correct output in both versions). Please refer to the attached code for more details.

InferenceTest.txt
Inference.txt

Correct Output (TF v0.11) -- 2 GPUs
python InferenceTest.py 2

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
**************** Using 2 GPUs *****************
TensorFlow version: 0.11.0
GPU ID List: [0, 1]
Device: /gpu:0
Device: /gpu:1
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x37b9dc0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size= 10
Results Queue Size= 0
Processing: Input Queue Size= 8
Processing: Results Queue Size= 2
Processing: Input Queue Size= 6
Processing: Results Queue Size= 4
Processing: Input Queue Size= 4
Processing: Results Queue Size= 6
Processing: Input Queue Size= 2
Processing: Results Queue Size= 8
Processing: Input Queue Size= 0
Processing: Results Queue Size= 10
************* Dequeue Results ********************
Results Queue Size= 10
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 9
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 8
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 7
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 6
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 5
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 4
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 3
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 2
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 1
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 0
Final Input Queue Size= 0
Final Results Queue Size= 0


Wrong Output (TF v1.0.0) -- 2 GPUs
python InferenceTest.py 2

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
**************** Using 2 GPUs *****************
TensorFlow version: 1.0.0
GPU ID List: [0, 1]
Device: /gpu:0
Device: /gpu:1
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x1999d60
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size= 10
Results Queue Size= 0
Processing: Input Queue Size= 9
Processing: Results Queue Size= 2
Processing: Input Queue Size= 8
Processing: Results Queue Size= 4
Processing: Input Queue Size= 7
Processing: Results Queue Size= 6
Processing: Input Queue Size= 6
Processing: Results Queue Size= 8
Processing: Input Queue Size= 5
Processing: Results Queue Size= 10
Processing: Input Queue Size= 4
Processing: Results Queue Size= 12
Processing: Input Queue Size= 3
Processing: Results Queue Size= 14
Processing: Input Queue Size= 2
Processing: Results Queue Size= 16
Processing: Input Queue Size= 1
Processing: Results Queue Size= 18
Processing: Input Queue Size= 0
Processing: Results Queue Size= 20
************* Dequeue Results ********************
Results Queue Size= 20
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 19
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 18
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 17
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 16
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 15
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 14
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 13
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 12
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 11
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 10
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 9
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 8
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 7
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 6
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 5
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 4
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 3
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 2
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 1
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 0
Final Input Queue Size= 0
Final Results Queue Size= 0

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

None

Environment info

Operating System: CentOS 7.2.1511

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
cudaversion

If installed from binary pip package, provide:

  1. A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp35-cp35m-linux_x86_64.whl
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)".

See the output above for 2 different tensor flow versions

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

What other attempted solutions have you tried?

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions