fix random generator: do not gen seed each time #9119

yzhliu · 2017-12-18T08:02:48Z

Description

The current implement for random sampler (#8179) raises a new (random) seed every time it starts to generate a random number, which is incorrect. @asmushetzel

Although std::mt19937 seems to work with this approach, samplers with curand generate low-quality of randomness and probably makes these numbers correlated.

Firstly, I noticed training with SGLD collapses to low ACC (#8958). Secondly, @sxjscience has written new test cases for random sampler, using mean/var/chi square test, non of them passes with the current implement.

According to Nvidia document for kernel random API, we need to maintain global seeds and reuse them. Moreover, the random state is not thread-safe.

The implement here maintains a fixed number of global random states and can be access through Resource. In case it is accessed by multiple GPU streams, (default) 4 independent GPU generators are created in global Resource.

I tested example/bayesian-methods and now it converges to reasonable results. It should pass @sxjscience 's new test cases as well.

The memory usage and speed barely changes.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage: Waiting for @sxjscience 's test cases
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Global seeds for GPU & CPU sampler.
Fix SGLD optimizer arguments.

sxjscience · 2017-12-18T08:13:17Z

@Javelinjs I've uploaded the script here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238. Currently we should rely on the result of "Chi Square test" and the "mean test" as the "var test" needs far more samples. The "chi square test" performs relatively the best.

piiswrong · 2017-12-18T18:10:31Z

src/common/random_generator.cc

+}
+
+template<>
+RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() {


return an value so you don't need DeleteRandGenerator

But for NewRandGenerator<gpu> it is allocated on device.

You should return an class that has an internal pointer to device.

sxjscience · 2017-12-18T22:02:14Z

@Javelinjs The updated gist here https://gist.github.com/sxjscience/453605a1ea3102bc0010f9fb16df8238 tests all the available random ops in MXNet: normal, uniform, gamma, exponential, poisson, negative_binomial, generalized_negative_binomial, multinomial

asmushetzel · 2017-12-19T12:22:57Z

src/common/random_generator.h

+const int kGPURndStateNum = 32768;
+
+// uniform number generation in Cuda made consistent with stl (include 0 but exclude 1)
+// by using 1.0-curand_uniform(). Needed as some samplers below won't be able to deal with


Comment taken from prior code. A bit misleading as it references "samplers below" (but in this file, there are none).

asmushetzel · 2017-12-19T13:20:30Z

src/operator/mxnet_op.h

+  inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *,
+                                                 common::random::RandGenerator<cpu, GType> *rnd,
+                                                 const int N, Args... args) {
+    // do not use openmp since it does not guarantee the output order.


Not clear to me what it means that "does not guarantee the output order". I really think we should support openmp here as sampling on CPU is equivalent important to sampling on CPU and we should not leave a potential speedup of 4-8 on the table. Sampling on CPU is really slow anyway.
Wouldn't it be natural to use the exact same design pattern as for the GPU case, i.e. a set of preallocated samplers as a GlobalSampler and then assign them to the different threads?

I mean, assume the actual sequence of std:: mt19937 is 0.1, 0.2, 0.3, 0.4, when it is generated for arr[4] with openmp this could become arr = {0.2, 0.1, 0.4, 0.3}.

since std:: mt19937 is thread-safe, I didn't preallocate it for multi-threads. But you're right, the same design could be adopted.

yes I think we should use openmp.

Adopting the same design would solve this ordering issue. And we don't need to pre-allocate thousands of CPU-samplers, 256 would certainly be enough.

asmushetzel · 2017-12-19T13:30:59Z

Nice catch and implementation. I was thinking about the same pattern (using a sufficiently large pool of pre-allocated random generators) as well in the initial implementation, but wasn't sure about blocking that much memory for the entire runtime. But in fact, it is the far better solution, not only in terms of sampling accuracy.
Left some more comments in the code.
One thing that should be added are unit tests to test_random.py that also verify "chi square" . We need to add the test that exhibited the problem with the prior implementation and ensures that from now on we never degrade again.

yzhliu · 2017-12-19T18:30:50Z

@asmushetzel I'll merge the test in #9129

fix lint fix lint fix typo fix docstring fix docstring

piiswrong · 2017-12-19T18:55:04Z

include/mxnet/resource.h

-    kTempSpace
+    kTempSpace,
+    /*! \brief common::RandGenerator<xpu> object, which can be used in GPU kernel functions */
+    kNativeRandom


Why use a new enum? Can this be merged with kRandom?

kRandom returns mshadow::Random, whose behavior is different from the new one.

I think it should be called kParallelRandom

+1 (kParallelRandom would be a name that expresses what it is good for) Same for all functions that have "native" in their name.

piiswrong · 2017-12-19T18:56:18Z

src/common/random_generator.cc

+}
+
+template<>
+RandGenerator<cpu, float> *NewRandGenerator<cpu, float>() {


You should return an class that has an internal pointer to device.

piiswrong · 2017-12-19T18:58:54Z

src/common/random_generator.h

+// (non-thread-safe) random generator stores global states,
+// always use mxnet_op::LaunchNativeRandomGenerator for launching a multi-threaded kernel.
+template<typename DType>
+class RandGeneratorGlobal<gpu, DType> : public RandGenerator<gpu, DType> {


why do you need this?

piiswrong · 2017-12-19T19:00:08Z

src/operator/mxnet_op.h

+  inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *,
+                                                 common::random::RandGenerator<cpu, GType> *rnd,
+                                                 const int N, Args... args) {
+    // do not use openmp since it does not guarantee the output order.


yes I think we should use openmp.

piiswrong · 2017-12-19T19:00:18Z

src/operator/mxnet_op.h

+   * \param args Varargs to eventually pass to the OP::Map() functoion
+   */
+  template<typename GType, typename ...Args>
+  inline static void LaunchNativeRandomGenerator(mshadow::Stream<cpu> *,


piiswrong · 2017-12-19T19:12:31Z

src/operator/mxnet_op.h

+                                                 const int N, Args... args) {
+    using namespace mshadow::cuda;
+    const int nloop(1 + (N - 1) / common::random::kGPUMinRndNumberPerThread);
+    int ngrid = std::min(common::random::kGPURndStateNum / kBaseThreadNum,


common::random::kGPURndStateNum / kBaseThreadNum could be 0

piiswrong · 2017-12-19T19:13:56Z

src/operator/mxnet_op.h

+  for (int i = id * kGPUMinRndNumberPerThread;
+        i < N;
+        i += nthread * kGPUMinRndNumberPerThread) {
+    for (int j = 0; j < kGPUMinRndNumberPerThread && i + j < N; ++j) {


These two loops look weird. Are you sure it should be i<N?

piiswrong · 2017-12-19T19:30:33Z

src/common/random_generator.h

+class RandGenerator;
+
+template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE>
+class RandGeneratorGlobal;


Looks like this should be the internal implementation of RandGenerator. It doesn't need to be a top level public class

piiswrong · 2017-12-20T18:12:15Z

I don't think we need LaunchRNG. Why not Launch with N = rnd->size() ?

yzhliu · 2017-12-20T23:54:40Z

@piiswrong By using LaunchRNG I want to hide the underlying implementation to users. Otherwise,

Users need to understand gpu state is not thread-safe, and pick a curand state in Op::Map, then loop the array carefully.
The implementation of Launch implies two adjacent array entries are accessed by two adjacent threads in one block. But in my understanding, we should generate successive numbers from one state, i.e., one thread, as much as we can. With Launch, this will make users suffering calculating the array element index, which can easily make mistake and the codes will be weird.
The curand states are allocated in global memory. For efficiency, we copy it to local memory when launching a kernel (and copy back to global at the end, as suggested in NV's doc). Users probably do not want to do it in every Op::Map function.

I think LaunchRNG is a convenient helper function to use. Users can still use Launch if they want to control everything.

yzhliu · 2017-12-20T23:57:07Z

I have refactor RandomGenerator for readability, and add openmp for CPU. ping @piiswrong @asmushetzel
I also merge @sxjscience 's PR #9129 in.

piiswrong · 2017-12-22T21:36:29Z

adding LaunchXX interface complicates the design.
You can add a helper function in the random operators file

piiswrong · 2017-12-22T21:40:12Z

src/common/random_generator.h

+
+#if MXNET_USE_CUDA
+
+// at least how many random numbers should be generated by one GPU thread.


why do we need this?

a number of contiguous random numbers should be generated from one state

piiswrong · 2017-12-22T21:41:25Z

src/common/random_generator.h

+// at least how many random numbers should be generated by one CPU thread.
+const int kCPUMinRndNumberPerThread = 64;
+// store how many global random states for CPU.
+const int kCPURndStateNum = 1024;


These should be
RandGenerator<cpu, DType>::kNumRandomStates

piiswrong · 2017-12-22T21:42:53Z

src/common/random_generator.h

+
+  // Free the allocated GPU memory.
+  // For global singleton,
+  // calling this in destructor may cause undefined behavior.


piiswrong · 2017-12-22T21:43:20Z

src/common/random_generator.h

+// Will use float data type whenever instantiated for half_t or any other non
+// standard real type.
+template<typename Device, typename DType MSHADOW_DEFAULT_DTYPE>
+class RandGeneratorImpl;


Why do you need RandGenerator and RandGeneratorImpl? Why not just one?

piiswrong · 2017-12-22T21:48:55Z

src/common/random_generator.h

+ public:
+  // Copy state to local memory for efficiency.
+  __device__ explicit RandGeneratorImpl(curandStatePhilox4_32_10_t *state)
+      : state_(*state) {}


So you are copying state to state_ by value. Then wouldn't the next call of the same random operator give you the same results?

I haven't tested the case of multiple runs of the same generator. I should add that.

Two choices:

Follow the exact same patterns in CPU and GPU, i.e. copy by value but then also ensure to save the state again back in the RandGenerator at the end of LaunchRNG (which isn't the case currently)

Copy state by reference in the CPU case
I would prefer a consistent handling for both cases (i.e. first version).

The implement here was correct. It did save the state back in LaunchRNG.
I now refactor it to a more readable version.

asmushetzel · 2017-12-23T12:54:20Z

src/common/random_generator.h

+  RandGenerator() {
+    cudaError_t e = cudaMalloc(&states_, kGPURndStateNum * sizeof(curandStatePhilox4_32_10_t));
+    if (e != cudaSuccess && e != cudaErrorCudartUnloading) {
+      throw std::bad_alloc();


Why not using the existing macros for interpreting cuda-errors (i.e. CUDA_CALL(cudaMalloc(.....))? That would also tell the user that something went wrong on the device, while throwing std::bad_alloc will be misleading as it refers to a memory allocation on the host.

asmushetzel · 2017-12-23T12:56:33Z

src/operator/mxnet_op.h

+#ifdef _OPENMP
+    const int omp_threads = std::min(kCPURndStateNum,
+        engine::OpenMP::Get()->GetRecommendedOMPThreadCount());
+    if (omp_threads < 2) {


I don't think you need special code for omp_threads < 2. The general loop below will work there as well and not create any overhead either.

This was intentional.
OpenMP disables some compiler optimizations, so an omp loop with 1 thread is slower than a simple loop without omp

asmushetzel · 2017-12-23T13:01:08Z

src/operator/mxnet_op.h

+                               common::random::RandGenerator<cpu, GType> *rnd,
+                               const int N, Args... args) {
+    using namespace mxnet::common::random;
+#ifdef _OPENMP


I don't think you need that ifdef. Even if we compile w/out openmp-support, the functions GetRecommendedOMPTHreadCount() and omp_get_thread_num() should be replaced by appropriate stubs. So your code does not need to do any specialties (see dmlc/omp.h and engine/openmp.cc).

piiswrong · 2017-12-27T19:27:56Z

python/mxnet/test_utils.py

@@ -34,6 +34,7 @@
 import numpy as np
 import numpy.testing as npt
 import numpy.random as rnd
+import scipy.stats as ss


we don't depend on scipy

How should I revise this? Move it to be inside the functions?

see mx.nd.sparse

@Javelinjs @piiswrong I've added one commit to solve the problem. Also,
I also add tests for the case in which the generator is triggered multiple times. 199fabd

yzhliu · 2017-12-28T02:30:45Z

merged changes from #9129

* add tests for distribution generators fix lint fix lint fix typo fix docstring fix docstring * [Bugfix] fix random generator: do not gen seed each time * gen samplers on gpu for test_softmax * fix test cases * remove unnecessary prints * refactor RandGenerator * get_native_random -> get_parallel_random * revise test cases + remove dependency of scipy * raise warning

yzhliu self-assigned this Dec 18, 2017

yzhliu requested review from piiswrong and sxjscience December 18, 2017 08:02

yzhliu added the Bug label Dec 18, 2017

yzhliu force-pushed the fix-rnd branch from 6db3da8 to bc38cfd Compare December 18, 2017 08:55

piiswrong reviewed Dec 18, 2017

View reviewed changes

sxjscience mentioned this pull request Dec 19, 2017

add tests for distribution generators #9129

Closed

7 tasks

asmushetzel reviewed Dec 19, 2017

View reviewed changes

sxjscience approved these changes Dec 19, 2017

View reviewed changes

add tests for distribution generators

a69c3cf

fix lint fix lint fix typo fix docstring fix docstring

piiswrong suggested changes Dec 19, 2017

View reviewed changes

[Bugfix] fix random generator: do not gen seed each time

d66fe17

yzhliu force-pushed the fix-rnd branch from 1dcdbb9 to d66fe17 Compare December 21, 2017 01:03

yzhliu and others added 4 commits December 20, 2017 20:57

gen samplers on gpu for test_softmax

011db2f

fix test cases

e6132af

remove unnecessary prints

1b082ac

merge from sxj/test_random_generators: remove unnecessary prints

cdae832

piiswrong reviewed Dec 22, 2017

View reviewed changes

asmushetzel reviewed Dec 23, 2017

View reviewed changes

yzhliu added 3 commits December 27, 2017 00:07

refactor RandGenerator

2a5eac5

Merge remote-tracking branch 'upstream/master' into fix-rnd

275599f

get_native_random -> get_parallel_random

127aad1

piiswrong reviewed Dec 27, 2017

View reviewed changes

sxjscience and others added 4 commits December 27, 2017 13:08

revise test cases + remove dependency of scipy

199fabd

raise warning

ce4d08f

Merge remote-tracking branch 'sxj/test_random_generators' into fix-rnd

29efdd0

Merge remote-tracking branch 'upstream/master' into fix-rnd

3c1608a

piiswrong merged commit 34a5195 into apache:master Dec 28, 2017


		#if MXNET_USE_CUDA

		// at least how many random numbers should be generated by one GPU thread.

fix random generator: do not gen seed each time #9119

fix random generator: do not gen seed each time #9119

Uh oh!

Conversation

yzhliu commented Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Uh oh!

sxjscience commented Dec 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sxjscience commented Dec 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmushetzel commented Dec 19, 2017

Uh oh!

yzhliu commented Dec 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmushetzel Dec 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piiswrong commented Dec 20, 2017

Uh oh!

yzhliu commented Dec 20, 2017

Uh oh!

yzhliu commented Dec 20, 2017

Uh oh!

piiswrong commented Dec 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yzhliu commented Dec 18, 2017 •

edited

Loading

asmushetzel Dec 23, 2017 •

edited

Loading

sxjscience Dec 22, 2017 •

edited

Loading

yzhliu Dec 27, 2017 •

edited

Loading

asmushetzel Dec 23, 2017 •

edited

Loading