-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Closed
Labels
Description
#5873 (comment)
Log from 4-process setup:
pid = 21916, Device = 0
pid = 21919, Device = 0
pid = 21921, Device = 0
pid = 21924, Device = 0
even though here we should have been using 4 GPUs. See #5963 (comment).
The undefined behavior exists in following tests:
tests/python-gpu/test_gpu_with_dask.py::TestDistributedGPU::test_dask_array
tests/distributed/runtests-gpu.sh
The behavior is "undefined" in the sense that using a different Docker container causes the tests to fail, even though they were succeeding previously.
Passing (current CI setup):
tests/ci_build/ci_build.sh gpu_build_centos6 docker --build-arg CUDA_VERSION=10.0 \
tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
-DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
tests/ci_build/test_python.sh mgpu
Failing (the test just hangs):
tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
-DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
tests/ci_build/test_python.sh mgpu