-
Notifications
You must be signed in to change notification settings - Fork 74.8k
Description
Here is an example multi-gpu inference code which produces incorrect Output in TF v1.0.0 but produces correct output in v0.11.0 (single GPU produces correct output in both versions). Please refer to the attached code for more details.
InferenceTest.txt
Inference.txt
Correct Output (TF v0.11) -- 2 GPUs
python InferenceTest.py 2
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
**************** Using 2 GPUs *****************
TensorFlow version: 0.11.0
GPU ID List: [0, 1]
Device: /gpu:0
Device: /gpu:1
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x37b9dc0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size= 10
Results Queue Size= 0
Processing: Input Queue Size= 8
Processing: Results Queue Size= 2
Processing: Input Queue Size= 6
Processing: Results Queue Size= 4
Processing: Input Queue Size= 4
Processing: Results Queue Size= 6
Processing: Input Queue Size= 2
Processing: Results Queue Size= 8
Processing: Input Queue Size= 0
Processing: Results Queue Size= 10
************* Dequeue Results ********************
Results Queue Size= 10
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 9
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 8
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 7
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 6
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 5
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 4
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 3
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 2
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 1
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 0
Final Input Queue Size= 0
Final Results Queue Size= 0
Wrong Output (TF v1.0.0) -- 2 GPUs
python InferenceTest.py 2
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
**************** Using 2 GPUs *****************
TensorFlow version: 1.0.0
GPU ID List: [0, 1]
Device: /gpu:0
Device: /gpu:1
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:05.0
Total memory: 11.21GiB
Free memory: 11.09GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x1999d60
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:09.0
Total memory: 11.21GiB
Free memory: 11.09GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla M40, pci bus id: 0000:00:09.0)
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:1 for node 'Tower_1/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Dequeue_Input_Data/fifo_queue_Dequeue' because the input edge from 'Input_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'Tower_0/Results_Enqueue/fifo_queue_enqueue' because the input edge from 'Results_FIFO_Queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0
Input Value 0
Input Value 1
Input Value 2
Input Value 3
Input Value 4
Input Value 5
Input Value 6
Input Value 7
Input Value 8
Input Value 9
Input Queue Size= 10
Results Queue Size= 0
Processing: Input Queue Size= 9
Processing: Results Queue Size= 2
Processing: Input Queue Size= 8
Processing: Results Queue Size= 4
Processing: Input Queue Size= 7
Processing: Results Queue Size= 6
Processing: Input Queue Size= 6
Processing: Results Queue Size= 8
Processing: Input Queue Size= 5
Processing: Results Queue Size= 10
Processing: Input Queue Size= 4
Processing: Results Queue Size= 12
Processing: Input Queue Size= 3
Processing: Results Queue Size= 14
Processing: Input Queue Size= 2
Processing: Results Queue Size= 16
Processing: Input Queue Size= 1
Processing: Results Queue Size= 18
Processing: Input Queue Size= 0
Processing: Results Queue Size= 20
************* Dequeue Results ********************
Results Queue Size= 20
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 19
Dequeue Results: Results Output 1:
Dequeue Results: Results Queue Size= 18
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 17
Dequeue Results: Results Output 2:
Dequeue Results: Results Queue Size= 16
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 15
Dequeue Results: Results Output 3:
Dequeue Results: Results Queue Size= 14
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 13
Dequeue Results: Results Output 4:
Dequeue Results: Results Queue Size= 12
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 11
Dequeue Results: Results Output 5:
Dequeue Results: Results Queue Size= 10
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 9
Dequeue Results: Results Output 6:
Dequeue Results: Results Queue Size= 8
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 7
Dequeue Results: Results Output 7:
Dequeue Results: Results Queue Size= 6
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 5
Dequeue Results: Results Output 8:
Dequeue Results: Results Queue Size= 4
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 3
Dequeue Results: Results Output 9:
Dequeue Results: Results Queue Size= 2
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 1
Dequeue Results: Results Output 10:
Dequeue Results: Results Queue Size= 0
Final Input Queue Size= 0
Final Results Queue Size= 0
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
None
Environment info
Operating System: CentOS 7.2.1511
Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*
):
If installed from binary pip package, provide:
- A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp35-cp35m-linux_x86_64.whl
- The output from
python -c "import tensorflow; print(tensorflow.__version__)"
.
See the output above for 2 different tensor flow versions
If installed from source, provide
- The commit hash (
git rev-parse HEAD
) - The output of
bazel version
If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)
What other attempted solutions have you tried?
Logs or other output that would be helpful
(If logs are large, please upload as attachment or provide link).