Skip to content

Some multi-GPU unit tests hang when running in a different Docker environment #5963

@hcho3

Description

@hcho3

#5873 (comment)
Log from 4-process setup:

pid = 21916, Device = 0
pid = 21919, Device = 0
pid = 21921, Device = 0
pid = 21924, Device = 0

even though here we should have been using 4 GPUs. See #5963 (comment).

The undefined behavior exists in following tests:

  • tests/python-gpu/test_gpu_with_dask.py::TestDistributedGPU::test_dask_array
  • tests/distributed/runtests-gpu.sh

The behavior is "undefined" in the sense that using a different Docker container causes the tests to fail, even though they were succeeding previously.

Passing (current CI setup):

tests/ci_build/ci_build.sh gpu_build_centos6 docker --build-arg CUDA_VERSION=10.0 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

Failing (the test just hangs):

tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/build_via_cmake.sh  -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
  tests/ci_build/test_python.sh mgpu

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions