DOC: remove references to RandomState #13778

tupui · 2021-03-31T10:05:34Z

Status

Since NumPy 1.17, the canonical way to create random numbers is:

import numpy as np
from numpy.random import default_rng
rng = default_rng()

And If a seed is to be used. User should not use a common value such as 0 or 123456. This is to prevent bias (everyone is using the same values) and ensure sufficient entropy for the generator. More details. The good practice (from NumPy's doc) is:

from numpy.random import SeedSequence
seed = SeedSequence().entropy

But in our documentation, we present the old way of dealing with RNG and show bad things like global states of RNG.

Proposal

I propose here to remove from the documentation any reference to RandomState and wrong (not best practice) usage of random number generators. It implies changing docstrings random_state,seed and using correct RNG in examples and tutorial. The main user is stats and the tutorial has been updated with a dedicated section for random number generation.

For reference, a discussion we recently had about RNG and seeding here: #13631 (comment)

ilayn · 2021-03-31T10:35:57Z

Well it's not wrong but just the old and not optimal way of doing it. Including the majority of our test suite, there is a wild amount of seed usage so I wouldn't jump to that conclusion.

tupui · 2021-03-31T10:44:01Z

Well it's not wrong but just the old and not optimal way of doing it. Including the majority of our test suite, there is a wild amount of seed usage so I wouldn't jump to that conclusion.

Using RandomState is not "wrong" but should not be used by final users. It is explicitly written in NumPy's doc that it stays for compatibility reasons. But the generator is not optimal (and also, only one option left in new versions).

On the test side, yes we have to keep it as this is the only way, for now, which is safely providing a deterministic stream of bits accros platforms and versions. Also noted in NumPy's doc.

IMO, we should only show good practices. From my experience, practitioners take our doc for a gold standard and in that regards, we have to be careful about what we show in examples. A lot of people just copy paste code and don't touch anything if it works. So if in examples we showcase something using a bad, or even just old, way it is not good.

As for taking a common seed and using a global seed. It is definitely an issue.

ilayn · 2021-03-31T10:47:43Z

I agree with all but we have to be careful by swinging from one side to the other. A gentler approach will go a long way for convincing users. Not changing every version to another conclusion;"1.17 was wrong 1.20 was correct now also wrong 1.34 is definitely correct now!!!!1!!" is not emitting the right message here. Showing the shortcomings and letting them convince themselves is a more effective strategy.

rkern

Changing the examples to use default_rng() instead of np.random or RandomState are very welcome. Some of the docstring changes for random_state arguments need to be reverted to accurately describe the behavior of check_random_state().

rkern · 2021-03-31T14:14:44Z

doc/source/tutorial/stats.rst

+And fixing the seed can be done like this:
+
+    >>> from numpy.random import SeedSequence
+    >>> seed = SeedSequence().entropy


This example doesn't fix the seed. Rerunning the whole example gets fresh entropy every time. Instead, I would put in an explicit number literal, like the old example. However, to improve on the old example, use a quality 128-bit number and in the .. warning:: below show how you can use SeedSequence().entropy to get such a number.

Ok thanks. I was really how to deal with this one. I should have said here that you could get a number by running the code and then use it as you pleased.

Ideally I did not want to put any number so that users would not copy past the same number all the time. I also thought about generating a fix number which would change every time you reload the page.

scipy/linalg/_sketches.py

scipy/sparse/construct.py

rkern · 2021-03-31T14:32:14Z

scipy/sparse/construct.py


-    >>> class CustomRandomState(np.random.RandomState):
+    >>> class CustomRandomState(np.random.Generator):


How necessary is this subclassing to the example? It's not commented on, or described as an intended API elsewhere in the docstring. Changing the semantics of a method like this is not a practice that I would recommend. If one needs to be able to specify to this method to sample even integers, as in this example, that needs to be specified by some other means.

Also, just swapping Generator for RandomState won't work in this example. This function uses the rng_integers() compatibility shim to draw integers. If random_state is a Generator, it will call the integers() method directly instead of randint(), so this overridden method gets ignored.

Indeed, I don't think this class is needed in the example. I will remove it.

scipy/spatial/transform/rotation.pyi

scipy/spatial/transform/rotation.pyx

scipy/stats/_distn_infrastructure.py

scipy/stats/_multivariate.py

tupui · 2021-03-31T14:56:37Z

Thanks @rkern for the review.

As you see I wanted to get rid of all RandomState even if under the hood we are calling check_random_state. We had a discussion when working on QMC and the result was that I coded a variant of this (scipy.stats._qmc:check_random_state). In the docstring of the module, we do not talk at all about RandomState even if yes, the functions would accept it. This is to push the user to use the Generator interface.

The idea here was to do the same everywhere. Keep this hidden behaviour while encouraging user to use RandomState. If we keep RandomState as options in the arguments, the linters will be fine. Which is not what I wanted.

rkern · 2021-03-31T15:01:31Z

I would prefer an accurate description of the random_state arguments as they are currently processed rather than one that describes one possible future state that does not yet exist (and might not! We may change our minds on what that future state looks like, in detail).

andyfaff · 2021-03-31T23:45:04Z

I'm in agreement with @rkern here. I'd prefer the docstring for the parameters to reflect what's actually accepted by the function. For the examples, changing them all to use np.random.Generator is fine.

Half of me does wonder if a phased deprecation away from RandomState would be worthwhile, but that would need to be part of a wider conversation. It would make internal code easier.

rkern · 2021-03-31T23:46:10Z

Omitting the RandomState option when describing the seed arguments in the QMC code makes sense; it's an incidental legacy option, and repeating the full list of options each time is overly verbose. I do advocate for not having to describe every possible valid value when some of them are not recommended. You are describing the current semantics of the code, just being judicious with the limited amount of text we want to spend describing it.

But we're not using that version of check_random_state() in all of these places. RandomState is not incidental to the processing of these arguments. Rather, it's fundamental to the processing of every valid input except a full-fledged Generator instance. Omitting mention of RandomState for those valid inputs means telling the user something incorrect about the current version of the code, not just omitting a valid but disrecommended option. It would be describing a possible future state of the code, but we haven't decided when (or whether, in all particulars) we'll commit to those semantics.

mdhaber · 2021-04-01T05:22:07Z

I see now that Generator does not have randint, and RandomState does not have integers. What are we supposed to do if we want to write a function that works with both RandomState and Generator? (Do we really need to try/catch or detect what type of object it is and call the appropriate method?)

andyfaff · 2021-04-01T05:56:00Z

I see now that Generator does not have randint, and RandomState does not have integers. What are we supposed to do if we want to write a function that works with both RandomState and Generator? (Do we really need to try/catch or detect what type of object it is and call the appropriate method?)

Well, funny you should ask this question, there is functionality for this here. This was introduced in #11680.

It might be better to write a wrapper:

def rng_integers(gen, low, high=None, size=None, dtype='int64',
                 endpoint=False):
    rng = check_random_state(gen)
    
    if isinstance(rng, np.random.Generator):
        def gener(low, high=None, size=None, dtype='int64', endpoint=False)
            return rng.integers(low, high=high, size=size, dtype=dtype,
                                endpoint=endpoint)
    else:
        # RandomState
        def gener(low, high=None, size=None, dtype='int64', endpoint=False)
            if endpoint:
                # inclusive of endpoint
                # remember that low and high can be arrays, so don't modify in
                # place
                if high is None:
                    return rng.randint(low + 1, size=size, dtype=dtype)
                if high is not None:
                    return rng.randint(low, high=high + 1, size=size, dtype=dtype)

            # exclusive
            return rng.randint(low, high=high, size=size, dtype=dtype)
    
    return gener

mdhaber · 2021-04-01T06:41:33Z

Thanks @andyfaff! I think I'll be ok with the function as-is.

Do you happen to know if this was an oversight (i.e. that people might want to use RandomState and Generator the same way) or was the name randint deemed too egregious to keep around?

tupui · 2021-04-01T06:42:07Z

OK then I will see to add back RandomState where we are using check_random_state.

@andyfaff I am really in favour of planning a deprecation cycle for RandomState. We are almost in a PY2-PY3 situation here.
There is a huge legacy here and we must be careful, I get this. On the other hand, some will never change because it "still" works (cf. the recent small discussion on the mail list). But for the sake of users not willing to update some values in their tests, we are paying a high dev price. (Yes, I saw production code with seeds too... But IMO this is just another wrong practice and hence should not be of any concern on our side). Even for the user it's confusing to see that many options for a given parameter. Sometimes you have to break things for people to move on and adopt best practices.

Also, everyone, feel free to commit directly on the branch if you see something.

mdhaber · 2021-04-01T06:58:57Z

@tupui Can you point us to the TLDR of why it matters? I wonder if I'm the only maintainer who doesn't know. I'm not arguing; I believe that it's the "right" thing to do now, but I haven't come across any answers that resonate with me as to why it matters in everyday code. The documentation says:

The main difference between the two is that Generator relies on an additional BitGenerator to manage state and generate the random bits, which are then transformed into random values from useful distributions.

but this doesn't actually indicate the difference is or why it's good; it only states what Generator does.

tupui · 2021-04-01T07:10:05Z

@tupui Can you point us to the TLDR of why it matters? I wonder if I'm the only maintainer who doesn't know. I'm not arguing; I believe that it's the "right" thing to do now, but I haven't come across any answers that resonate with me as to why it matters in everyday code.

TBH, in terms of statistical properties of the RNG, I don't know as I did not dive into it. But on the practical side, there are new things like support of axis for choice, permutation, shuffle. Or just a clearer interface as there are not anymore several functions doing the same (or almost) things like random_sample and rand. Also if my reading is correct, the new generators are faster.

But @rkern would answer better here as he wrote NEP19 https://numpy.org/neps/nep-0019-rng-policy.html

mdhaber · 2021-04-01T07:22:37Z

Thanks, I'll read that. +10 for axis in those.

andyfaff · 2021-04-01T07:28:30Z

The NEP is a nice read. That definitely suggests to me the worth the deprecation cycle to remove the technical debt.

tupui · 2021-04-01T07:30:25Z

The NEP is a nice read. That definitely suggests to me the worth the deprecation cycle to remove the technical debt.

If we would go with this, I could also change the PR toward deprecating it. Let me all know 😄

rgommers · 2021-04-01T08:46:49Z

If we would go with this, I could also change the PR toward deprecating it.

Please don't include it here. I don't think we want to do that deprecation - that's a large amount of churn for users, for limited benefits for SciPy itself. But if we want to do it, it's best to discuss on a separate issue/PR.

rkern · 2021-04-01T14:27:40Z

There are a couple of things that might people think might be meant by "deprecating it". The one I think some are arguing against is to noisily warn about and eventually reject RandomState instances that are passed in.

However, I do think at some point we will want to change the behavior of the general check_random_state() to be like the version in the QMC code, to be Generator-first, processing integer and None seeds with default_rng() instead of RandomState (and probably also extracting the BitGenerator from a passed-in RandomState to wrap with Generator instead of passing the RandomState through). We don't fix bugs in the distributions on RandomState anymore, and we will often want to work with the newer APIs that are only available on Generator. Those changes probably do warrant some kind of managed change, though I'm not convinced it has to be noisy deprecation warnings.

rkern · 2021-04-01T14:29:09Z

But yeah, not in this branch. The example cleanups (and mentioning Generator in the docstrings that hadn't gotten that update) are a great contribution alone.

tupui · 2021-04-01T14:31:09Z

@rkern can you just confirm that I can use this docstring for seed using the current check_random_state:

"""
    seed : {None, int, `numpy.random.Generator`,
            `numpy.random.RandomState`}, optional
        If `seed` is None (or `np.random`), the `numpy.random.RandomState`
        singleton is used.
        If `seed` is an int, a new ``RandomState`` instance is used,
        seeded with `seed`.
        If `seed` is already a ``Generator`` or ``RandomState`` instance then
        that instance is used.
"""

This is just a slightly modified version of the one from QMC. I would use this also to update the current docstrings of check_random_state of course (if I am not mistaking, this should be the same just with links and slight re-wording). This way we have the exact same docstrings everywhere.

I would prefer not to go through everything 10 times 😅

[EDIT: I started like this, hope its fine... 🤞 ]

ev-br · 2021-04-01T15:02:42Z

One possible enhancement (in a separate PR!) could be to make check_random_state recognize a np.random.SeedSequence instance and seed a Generator with it.

tupui · 2021-04-01T15:03:16Z

One possible enhancement (in a separate PR!) could be to make check_random_state recognize a np.random.SeedSequence instance and seed a Generator with it.

I can do this afterwards sure 😉

…dn/standard_normal

tupui · 2021-04-02T14:10:19Z

@rkern @ilayn I integrated your suggestions and fixed the CI. Could you please have another look?

doc/source/tutorial/stats.rst

rkern

Just a few more very minor suggestions to consider.

doc/source/tutorial/stats.rst

scipy/linalg/decomp_svd.py

Co-authored-by: Robert Kern <robert.kern@gmail.com>

tupui · 2021-04-02T18:10:45Z

Thank you for the review @rkern! Shall we merge then?

rkern · 2021-04-02T18:30:41Z

@mdhaber

Do you happen to know if this was an oversight (i.e. that people might want to use RandomState and Generator the same way) or was the name randint deemed too egregious to keep around?

The randint() API had accumulated a lot of cruft as we tacked on functionality and fixed deficiencies of the original implementation (e.g. to correctly handle the semantics of the non-inclusive upper bound for each dtype even though that bound cannot be expressed in that dtype) while maintaining backwards compatibility in RandomState. integers() is the cleaned-up version, how we would design the API given the current set of functionality if we had a fresh start. Because the API is different, it needed a different name. If it were solely up to me, I might still have kept the randint() method there (translating its calls to integers() internally) so that Generator would be more of a drop-in replacement, but using rng_integers() (which mimics the integers() interface and translates it to randint() when RandomState is used) is a better in the long-term as RandomState disappears. We won't keep the legacy of the crufty randint() interface forever.

rkern · 2021-04-02T20:02:17Z

@tupui This final form LGTM.

tupui · 2021-04-02T20:18:41Z

@tupui This final form LGTM.

Thanks. I propose to merge this tomorrow then.

mdhaber · 2021-04-03T14:59:17Z

@rkern thanks. I hadn't noticed the API difference yet. rng_integers is a good bridge during the transition.

sturlamolden · 2021-05-10T20:00:06Z

Status

Since NumPy 1.17, the canonical way to create random numbers is:
import numpy as np
from numpy.random import default_rng
rng = default_rng()

NumPy documentation says that the default generator can change in future version.

“The Generator is the user-facing object that is nearly identical to the legacy RandomState. It accepts a bit generator instance as an argument. The default is currently PCG64 but this may change in future versions.”

https://numpy.org/doc/stable/reference/random/index.html

This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified.

tupui · 2021-05-10T20:37:15Z

This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified.

Robert already replied in the PR.

On my side, I stand by this change. It would be totally up to me that there would be absolutely no seeding. It can give you a fake sense of things working although they are not.

Anyway, here the examples are user facing and yes NEP19 is explicitly advocating for using what we proposed here. It's a very little prize to pay to fix a few numbers in the unlikely even that the generator changes. FYI, it took me at most 1h to replace them all once I had the logic in place. So not really any discussions to have here. Furthermore, the tests are still using RandomState as NEP19 promotes.

Good hygiene around RNG is paramount and as a pilar of the statistical environment in Python, SciPy must be exemplary.

tupui added 2 commits March 31, 2021 09:56

DOC: update stats tutorial

1ef7d4c

DOC: remove occurrence of RandomState from docstrings and doctests

f638a47

tupui added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Mar 31, 2021

tupui requested review from rgommers and ev-br March 31, 2021 10:05

rkern reviewed Mar 31, 2021

View reviewed changes

tupui added 2 commits April 1, 2021 18:10

DOC: revert RandomState for check_random_state docstrings and fix ran…

fb263d9

…dn/standard_normal

STY: pep8 adjustments

bb3432a

tupui added 2 commits April 2, 2021 09:59

STY: pep8 adjustments

26373fe

DOC: remove import numpy from examples

1472419

tupui force-pushed the random_generator branch from 50022e3 to cbd08e4 Compare April 2, 2021 12:20

DOC: fix doctests random values

62fba3f

tupui force-pushed the random_generator branch from cbd08e4 to 62fba3f Compare April 2, 2021 12:58

rkern reviewed Apr 2, 2021

View reviewed changes

doc/source/tutorial/stats.rst Outdated Show resolved Hide resolved

rkern approved these changes Apr 2, 2021

View reviewed changes

doc/source/tutorial/stats.rst Outdated Show resolved Hide resolved

scipy/linalg/decomp_svd.py Show resolved Hide resolved

Apply suggestions from code review

f068ec1

Co-authored-by: Robert Kern <robert.kern@gmail.com>

tupui merged commit 08390dc into scipy:master Apr 3, 2021

tupui deleted the random_generator branch April 3, 2021 08:32

tylerjereddy added this to the 1.7.0 milestone Apr 3, 2021

This was referenced Apr 12, 2021

DOC: remove references to RandomState astropy/astropy#11519

Closed

ENH: stats: add method parameter to differential_entropy + more accurate estimators for small samples #13845

Merged

DOC: remove seed in examples #13863

Merged

tupui mentioned this pull request Apr 21, 2021

ENH/DOC: remove references to np.random.RandomState in favour of np.random.Generator scikit-image/scikit-image#5342

Closed

sturlamolden mentioned this pull request May 10, 2021

ENH: stats: add bootstrap for estimating confidence interval and standard error of an n-sample statistic #13371

Merged

11 tasks

tupui mentioned this pull request Jun 29, 2021

ENH: refactor RNG usage to use np.random.Generator #14322

Open

This was referenced Jul 18, 2021

ENH: stats: Integrate library UNU.RAN in scipy.stats [GSoC 2021] #14215

Merged

WIP: ENH: PROPACK wrappers mckib2/scipy#9

Closed

mdhaber mentioned this pull request Nov 18, 2022

DOC: Improve example for uniform_direction distribution #17439

Merged

oyamad mentioned this pull request Nov 23, 2022

ENH: check_random_state: Accept np.random.Generator QuantEcon/QuantEcon.py#654

Merged

7 tasks

mdhaber mentioned this pull request Feb 18, 2023

DOC: incorrect example for stats.cramervonmises #17999

Closed


		>>> class CustomRandomState(np.random.RandomState):
		>>> class CustomRandomState(np.random.Generator):

Uh oh!

DOC: remove references to RandomState #13778

DOC: remove references to RandomState #13778

Uh oh!

Conversation

tupui commented Mar 31, 2021

Status

Proposal

Uh oh!

ilayn commented Mar 31, 2021

Uh oh!

tupui commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilayn commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkern left a comment

Choose a reason for hiding this comment

Uh oh!

rkern Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

tupui Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rkern Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

tupui Apr 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tupui commented Mar 31, 2021

Uh oh!

rkern commented Mar 31, 2021

Uh oh!

andyfaff commented Mar 31, 2021

Uh oh!

rkern commented Mar 31, 2021

Uh oh!

mdhaber commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyfaff commented Apr 1, 2021

Uh oh!

mdhaber commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tupui commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdhaber commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tupui commented Apr 1, 2021

Uh oh!

mdhaber commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyfaff commented Apr 1, 2021

Uh oh!

tupui commented Apr 1, 2021

Uh oh!

rgommers commented Apr 1, 2021

Uh oh!

rkern commented Apr 1, 2021

Uh oh!

rkern commented Apr 1, 2021

Uh oh!

tupui commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ev-br commented Apr 1, 2021

Uh oh!

tupui commented Mar 31, 2021 •

edited

Loading

ilayn commented Mar 31, 2021 •

edited

Loading

mdhaber commented Apr 1, 2021 •

edited

Loading

mdhaber commented Apr 1, 2021 •

edited

Loading

tupui commented Apr 1, 2021 •

edited

Loading

mdhaber commented Apr 1, 2021 •

edited

Loading

mdhaber commented Apr 1, 2021 •

edited

Loading

tupui commented Apr 1, 2021 •

edited

Loading