-
Notifications
You must be signed in to change notification settings - Fork 6.8k
fix random generator: do not gen seed each time #9119
Conversation
@Javelinjs I've uploaded the script here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238. Currently we should rely on the result of "Chi Square test" and the "mean test" as the "var test" needs far more samples. The "chi square test" performs relatively the best. |
src/common/random_generator.cc
Outdated
} | ||
|
||
template<> | ||
RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return an value so you don't need DeleteRandGenerator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for NewRandGenerator<gpu>
it is allocated on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should return an class that has an internal pointer to device.
@Javelinjs The updated gist here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238 tests all the available random ops in MXNet: normal, uniform, gamma, exponential, poisson, negative_binomial, generalized_negative_binomial, multinomial |
src/common/random_generator.h
Outdated
const int kGPURndStateNum = 32768; | ||
|
||
// uniform number generation in Cuda made consistent with stl (include 0 but exclude 1) | ||
// by using 1.0-curand_uniform(). Needed as some samplers below won't be able to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment taken from prior code. A bit misleading as it references "samplers below" (but in this file, there are none).
src/operator/mxnet_op.h
Outdated
inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, | ||
common::random::RandGenerator<cpu, GType> *rnd, | ||
const int N, Args... args) { | ||
// do not use openmp since it does not guarantee the output order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me what it means that "does not guarantee the output order". I really think we should support openmp here as sampling on CPU is equivalent important to sampling on CPU and we should not leave a potential speedup of 4-8 on the table. Sampling on CPU is really slow anyway.
Wouldn't it be natural to use the exact same design pattern as for the GPU case, i.e. a set of preallocated samplers as a GlobalSampler and then assign them to the different threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, assume the actual sequence of std:: mt19937
is 0.1, 0.2, 0.3, 0.4
, when it is generated for arr[4]
with openmp this could become arr = {0.2, 0.1, 0.4, 0.3}
.
since std:: mt19937
is thread-safe, I didn't preallocate it for multi-threads. But you're right, the same design could be adopted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think we should use openmp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adopting the same design would solve this ordering issue. And we don't need to pre-allocate thousands of CPU-samplers, 256 would certainly be enough.
Nice catch and implementation. I was thinking about the same pattern (using a sufficiently large pool of pre-allocated random generators) as well in the initial implementation, but wasn't sure about blocking that much memory for the entire runtime. But in fact, it is the far better solution, not only in terms of sampling accuracy. |
@asmushetzel I'll merge the test in #9129 |
fix lint fix lint fix typo fix docstring fix docstring
include/mxnet/resource.h
Outdated
kTempSpace | ||
kTempSpace, | ||
/*! \brief common::RandGenerator<xpu> object, which can be used in GPU kernel functions */ | ||
kNativeRandom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a new enum? Can this be merged with kRandom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kRandom
returns mshadow::Random
, whose behavior is different from the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be called kParallelRandom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (kParallelRandom would be a name that expresses what it is good for) Same for all functions that have "native" in their name.
src/common/random_generator.cc
Outdated
} | ||
|
||
template<> | ||
RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should return an class that has an internal pointer to device.
src/common/random_generator.h
Outdated
// (non-thread-safe) random generator stores global states, | ||
// always use mxnet_op::LaunchNativeRandomGenerator for launching a multi-threaded kernel. | ||
template<typename DType> | ||
class RandGeneratorGlobal<gpu, DType> : public RandGenerator<gpu, DType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
src/operator/mxnet_op.h
Outdated
inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, | ||
common::random::RandGenerator<cpu, GType> *rnd, | ||
const int N, Args... args) { | ||
// do not use openmp since it does not guarantee the output order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think we should use openmp.
src/operator/mxnet_op.h
Outdated
* \param args Varargs to eventually pass to the OP::Map() functoion | ||
*/ | ||
template<typename GType, typename ...Args> | ||
inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LaunchRNG
src/operator/mxnet_op.h
Outdated
const int N, Args... args) { | ||
using namespace mshadow::cuda; | ||
const int nloop(1 + (N - 1) / common::random::kGPUMinRndNumberPerThread); | ||
int ngrid = std::min(common::random::kGPURndStateNum / kBaseThreadNum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
common::random::kGPURndStateNum / kBaseThreadNum could be 0
src/operator/mxnet_op.h
Outdated
for (int i = id * kGPUMinRndNumberPerThread; | ||
i < N; | ||
i += nthread * kGPUMinRndNumberPerThread) { | ||
for (int j = 0; j < kGPUMinRndNumberPerThread && i + j < N; ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two loops look weird. Are you sure it should be i<N?
src/common/random_generator.h
Outdated
class RandGenerator; | ||
|
||
template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE> | ||
class RandGeneratorGlobal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should be the internal implementation of RandGenerator. It doesn't need to be a top level public class
I don't think we need LaunchRNG. Why not Launch with N = rnd->size() ? |
@piiswrong By using
I think |
I have refactor |
adding LaunchXX interface complicates the design. |
src/common/random_generator.h
Outdated
|
||
#if MXNET_USE_CUDA | ||
|
||
// at least how many random numbers should be generated by one GPU thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of contiguous random numbers should be generated from one state
src/common/random_generator.h
Outdated
// at least how many random numbers should be generated by one CPU thread. | ||
const int kCPUMinRndNumberPerThread = 64; | ||
// store how many global random states for CPU. | ||
const int kCPURndStateNum = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be
RandGenerator<cpu, DType>::kNumRandomStates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
src/common/random_generator.h
Outdated
|
||
// Free the allocated GPU memory. | ||
// For global singleton, | ||
// calling this in destructor may cause undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/common/random_generator.h
Outdated
// Will use float data type whenever instantiated for half_t or any other non | ||
// standard real type. | ||
template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE> | ||
class RandGeneratorImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need RandGenerator and RandGeneratorImpl? Why not just one?
src/common/random_generator.h
Outdated
public: | ||
// Copy state to local memory for efficiency. | ||
__device__ explicit RandGeneratorImpl(curandStatePhilox4_32_10_t *state) | ||
: state_(*state) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are copying state to state_ by value. Then wouldn't the next call of the same random operator give you the same results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested the case of multiple runs of the same generator. I should add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two choices:
- Follow the exact same patterns in CPU and GPU, i.e. copy by value but then also ensure to save the state again back in the RandGenerator at the end of LaunchRNG (which isn't the case currently)
- Copy state by reference in the CPU case
I would prefer a consistent handling for both cases (i.e. first version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implement here was correct. It did save the state back in LaunchRNG
.
I now refactor it to a more readable version.
src/common/random_generator.h
Outdated
RandGenerator() { | ||
cudaError_t e = cudaMalloc(&states_, kGPURndStateNum * sizeof(curandStatePhilox4_32_10_t)); | ||
if (e != cudaSuccess && e != cudaErrorCudartUnloading) { | ||
throw std::bad_alloc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the existing macros for interpreting cuda-errors (i.e. CUDA_CALL(cudaMalloc(.....))? That would also tell the user that something went wrong on the device, while throwing std::bad_alloc will be misleading as it refers to a memory allocation on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
src/operator/mxnet_op.h
Outdated
#ifdef _OPENMP | ||
const int omp_threads = std::min(kCPURndStateNum, | ||
engine::OpenMP::Get()->GetRecommendedOMPThreadCount()); | ||
if (omp_threads < 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need special code for omp_threads < 2. The general loop below will work there as well and not create any overhead either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional.
OpenMP disables some compiler optimizations, so an omp loop with 1 thread is slower than a simple loop without omp
src/operator/mxnet_op.h
Outdated
common::random::RandGenerator<cpu, GType> *rnd, | ||
const int N, Args... args) { | ||
using namespace mxnet::common::random; | ||
#ifdef _OPENMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need that ifdef. Even if we compile w/out openmp-support, the functions GetRecommendedOMPTHreadCount() and omp_get_thread_num() should be replaced by appropriate stubs. So your code does not need to do any specialties (see dmlc/omp.h and engine/openmp.cc).
python/mxnet/test_utils.py
Outdated
@@ -34,6 +34,7 @@ | |||
import numpy as np | |||
import numpy.testing as npt | |||
import numpy.random as rnd | |||
import scipy.stats as ss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't depend on scipy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should I revise this? Move it to be inside the functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see mx.nd.sparse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Javelinjs @piiswrong I've added one commit to solve the problem. Also,
I also add tests for the case in which the generator is triggered multiple times. 199fabd
merged changes from #9129 |
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning
Description
The current implement for random sampler (#8179) raises a new (random) seed every time it starts to generate a random number, which is incorrect. @asmushetzel
Although
std::mt19937
seems to work with this approach, samplers withcurand
generate low-quality of randomness and probably makes these numbers correlated.Firstly, I noticed training with
SGLD
collapses to low ACC (#8958). Secondly, @sxjscience has written new test cases for random sampler, using mean/var/chi square test, non of them passes with the current implement.According to Nvidia document for kernel random API, we need to maintain global seeds and reuse them. Moreover, the random state is not thread-safe.
The implement here maintains a fixed number of global random states and can be access through
Resource
. In case it is accessed by multiple GPU streams, (default)4
independent GPU generators are created in globalResource
.I tested
example/bayesian-methods
and now it converges to reasonable results. It should pass @sxjscience 's new test cases as well.The memory usage and speed barely changes.
Checklist
Essentials
make lint
)Changes
SGLD
optimizer arguments.