Skip to content

Conversation

tupui
Copy link
Member

@tupui tupui commented Mar 31, 2021

Status

Since NumPy 1.17, the canonical way to create random numbers is:

import numpy as np
from numpy.random import default_rng
rng = default_rng()

And If a seed is to be used. User should not use a common value such as 0 or 123456. This is to prevent bias (everyone is using the same values) and ensure sufficient entropy for the generator. More details. The good practice (from NumPy's doc) is:

from numpy.random import SeedSequence
seed = SeedSequence().entropy

But in our documentation, we present the old way of dealing with RNG and show bad things like global states of RNG.

Proposal

I propose here to remove from the documentation any reference to RandomState and wrong (not best practice) usage of random number generators. It implies changing docstrings random_state,seed and using correct RNG in examples and tutorial. The main user is stats and the tutorial has been updated with a dedicated section for random number generation.

For reference, a discussion we recently had about RNG and seeding here: #13631 (comment)

@tupui tupui added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Mar 31, 2021
@tupui tupui requested review from rgommers and ev-br March 31, 2021 10:05
@ilayn
Copy link
Member

ilayn commented Mar 31, 2021

Well it's not wrong but just the old and not optimal way of doing it. Including the majority of our test suite, there is a wild amount of seed usage so I wouldn't jump to that conclusion.

@tupui
Copy link
Member Author

tupui commented Mar 31, 2021

Well it's not wrong but just the old and not optimal way of doing it. Including the majority of our test suite, there is a wild amount of seed usage so I wouldn't jump to that conclusion.

Using RandomState is not "wrong" but should not be used by final users. It is explicitly written in NumPy's doc that it stays for compatibility reasons. But the generator is not optimal (and also, only one option left in new versions).

On the test side, yes we have to keep it as this is the only way, for now, which is safely providing a deterministic stream of bits accros platforms and versions. Also noted in NumPy's doc.

IMO, we should only show good practices. From my experience, practitioners take our doc for a gold standard and in that regards, we have to be careful about what we show in examples. A lot of people just copy paste code and don't touch anything if it works. So if in examples we showcase something using a bad, or even just old, way it is not good.

As for taking a common seed and using a global seed. It is definitely an issue.

@ilayn
Copy link
Member

ilayn commented Mar 31, 2021

I agree with all but we have to be careful by swinging from one side to the other. A gentler approach will go a long way for convincing users. Not changing every version to another conclusion;"1.17 was wrong 1.20 was correct now also wrong 1.34 is definitely correct now!!!!1!!" is not emitting the right message here. Showing the shortcomings and letting them convince themselves is a more effective strategy.

Copy link
Member

@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the examples to use default_rng() instead of np.random or RandomState are very welcome. Some of the docstring changes for random_state arguments need to be reverted to accurately describe the behavior of check_random_state().

And fixing the seed can be done like this:

>>> from numpy.random import SeedSequence
>>> seed = SeedSequence().entropy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't fix the seed. Rerunning the whole example gets fresh entropy every time. Instead, I would put in an explicit number literal, like the old example. However, to improve on the old example, use a quality 128-bit number and in the .. warning:: below show how you can use SeedSequence().entropy to get such a number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. I was really how to deal with this one. I should have said here that you could get a number by running the code and then use it as you pleased.

Ideally I did not want to put any number so that users would not copy past the same number all the time. I also thought about generating a fix number which would change every time you reload the page.


>>> class CustomRandomState(np.random.RandomState):
>>> class CustomRandomState(np.random.Generator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How necessary is this subclassing to the example? It's not commented on, or described as an intended API elsewhere in the docstring. Changing the semantics of a method like this is not a practice that I would recommend. If one needs to be able to specify to this method to sample even integers, as in this example, that needs to be specified by some other means.

Also, just swapping Generator for RandomState won't work in this example. This function uses the rng_integers() compatibility shim to draw integers. If random_state is a Generator, it will call the integers() method directly instead of randint(), so this overridden method gets ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I don't think this class is needed in the example. I will remove it.

@tupui
Copy link
Member Author

tupui commented Mar 31, 2021

Thanks @rkern for the review.

As you see I wanted to get rid of all RandomState even if under the hood we are calling check_random_state. We had a discussion when working on QMC and the result was that I coded a variant of this (scipy.stats._qmc:check_random_state). In the docstring of the module, we do not talk at all about RandomState even if yes, the functions would accept it. This is to push the user to use the Generator interface.

The idea here was to do the same everywhere. Keep this hidden behaviour while encouraging user to use RandomState. If we keep RandomState as options in the arguments, the linters will be fine. Which is not what I wanted.

@rkern
Copy link
Member

rkern commented Mar 31, 2021

I would prefer an accurate description of the random_state arguments as they are currently processed rather than one that describes one possible future state that does not yet exist (and might not! We may change our minds on what that future state looks like, in detail).

@andyfaff
Copy link
Contributor

I'm in agreement with @rkern here. I'd prefer the docstring for the parameters to reflect what's actually accepted by the function. For the examples, changing them all to use np.random.Generator is fine.

Half of me does wonder if a phased deprecation away from RandomState would be worthwhile, but that would need to be part of a wider conversation. It would make internal code easier.

@rkern
Copy link
Member

rkern commented Mar 31, 2021

Omitting the RandomState option when describing the seed arguments in the QMC code makes sense; it's an incidental legacy option, and repeating the full list of options each time is overly verbose. I do advocate for not having to describe every possible valid value when some of them are not recommended. You are describing the current semantics of the code, just being judicious with the limited amount of text we want to spend describing it.

But we're not using that version of check_random_state() in all of these places. RandomState is not incidental to the processing of these arguments. Rather, it's fundamental to the processing of every valid input except a full-fledged Generator instance. Omitting mention of RandomState for those valid inputs means telling the user something incorrect about the current version of the code, not just omitting a valid but disrecommended option. It would be describing a possible future state of the code, but we haven't decided when (or whether, in all particulars) we'll commit to those semantics.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 1, 2021

I see now that Generator does not have randint, and RandomState does not have integers. What are we supposed to do if we want to write a function that works with both RandomState and Generator? (Do we really need to try/catch or detect what type of object it is and call the appropriate method?)

@andyfaff
Copy link
Contributor

andyfaff commented Apr 1, 2021

I see now that Generator does not have randint, and RandomState does not have integers. What are we supposed to do if we want to write a function that works with both RandomState and Generator? (Do we really need to try/catch or detect what type of object it is and call the appropriate method?)

Well, funny you should ask this question, there is functionality for this here. This was introduced in #11680.

It might be better to write a wrapper:

def rng_integers(gen, low, high=None, size=None, dtype='int64',
                 endpoint=False):
    rng = check_random_state(gen)
    
    if isinstance(rng, np.random.Generator):
        def gener(low, high=None, size=None, dtype='int64', endpoint=False)
            return rng.integers(low, high=high, size=size, dtype=dtype,
                                endpoint=endpoint)
    else:
        # RandomState
        def gener(low, high=None, size=None, dtype='int64', endpoint=False)
            if endpoint:
                # inclusive of endpoint
                # remember that low and high can be arrays, so don't modify in
                # place
                if high is None:
                    return rng.randint(low + 1, size=size, dtype=dtype)
                if high is not None:
                    return rng.randint(low, high=high + 1, size=size, dtype=dtype)

            # exclusive
            return rng.randint(low, high=high, size=size, dtype=dtype)
    
    return gener

@mdhaber
Copy link
Contributor

mdhaber commented Apr 1, 2021

Thanks @andyfaff! I think I'll be ok with the function as-is.

Do you happen to know if this was an oversight (i.e. that people might want to use RandomState and Generator the same way) or was the name randint deemed too egregious to keep around?

@tupui
Copy link
Member Author

tupui commented Apr 1, 2021

OK then I will see to add back RandomState where we are using check_random_state.

@andyfaff I am really in favour of planning a deprecation cycle for RandomState. We are almost in a PY2-PY3 situation here.
There is a huge legacy here and we must be careful, I get this. On the other hand, some will never change because it "still" works (cf. the recent small discussion on the mail list). But for the sake of users not willing to update some values in their tests, we are paying a high dev price. (Yes, I saw production code with seeds too... But IMO this is just another wrong practice and hence should not be of any concern on our side). Even for the user it's confusing to see that many options for a given parameter. Sometimes you have to break things for people to move on and adopt best practices.

Also, everyone, feel free to commit directly on the branch if you see something.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 1, 2021

@tupui Can you point us to the TLDR of why it matters? I wonder if I'm the only maintainer who doesn't know. I'm not arguing; I believe that it's the "right" thing to do now, but I haven't come across any answers that resonate with me as to why it matters in everyday code. The documentation says:

The main difference between the two is that Generator relies on an additional BitGenerator to manage state and generate the random bits, which are then transformed into random values from useful distributions.

but this doesn't actually indicate the difference is or why it's good; it only states what Generator does.

@tupui
Copy link
Member Author

tupui commented Apr 1, 2021

@tupui Can you point us to the TLDR of why it matters? I wonder if I'm the only maintainer who doesn't know. I'm not arguing; I believe that it's the "right" thing to do now, but I haven't come across any answers that resonate with me as to why it matters in everyday code.

TBH, in terms of statistical properties of the RNG, I don't know as I did not dive into it. But on the practical side, there are new things like support of axis for choice, permutation, shuffle. Or just a clearer interface as there are not anymore several functions doing the same (or almost) things like random_sample and rand. Also if my reading is correct, the new generators are faster.

But @rkern would answer better here as he wrote NEP19 https://numpy.org/neps/nep-0019-rng-policy.html

@mdhaber
Copy link
Contributor

mdhaber commented Apr 1, 2021

Thanks, I'll read that. +10 for axis in those.

@andyfaff
Copy link
Contributor

andyfaff commented Apr 1, 2021

The NEP is a nice read. That definitely suggests to me the worth the deprecation cycle to remove the technical debt.

@tupui
Copy link
Member Author

tupui commented Apr 1, 2021

The NEP is a nice read. That definitely suggests to me the worth the deprecation cycle to remove the technical debt.

If we would go with this, I could also change the PR toward deprecating it. Let me all know 😄

@rgommers
Copy link
Member

rgommers commented Apr 1, 2021

If we would go with this, I could also change the PR toward deprecating it.

Please don't include it here. I don't think we want to do that deprecation - that's a large amount of churn for users, for limited benefits for SciPy itself. But if we want to do it, it's best to discuss on a separate issue/PR.

@rkern
Copy link
Member

rkern commented Apr 1, 2021

There are a couple of things that might people think might be meant by "deprecating it". The one I think some are arguing against is to noisily warn about and eventually reject RandomState instances that are passed in.

However, I do think at some point we will want to change the behavior of the general check_random_state() to be like the version in the QMC code, to be Generator-first, processing integer and None seeds with default_rng() instead of RandomState (and probably also extracting the BitGenerator from a passed-in RandomState to wrap with Generator instead of passing the RandomState through). We don't fix bugs in the distributions on RandomState anymore, and we will often want to work with the newer APIs that are only available on Generator. Those changes probably do warrant some kind of managed change, though I'm not convinced it has to be noisy deprecation warnings.

@rkern
Copy link
Member

rkern commented Apr 1, 2021

But yeah, not in this branch. The example cleanups (and mentioning Generator in the docstrings that hadn't gotten that update) are a great contribution alone.

@tupui
Copy link
Member Author

tupui commented Apr 1, 2021

@rkern can you just confirm that I can use this docstring for seed using the current check_random_state:

"""
    seed : {None, int, `numpy.random.Generator`,
            `numpy.random.RandomState`}, optional
        If `seed` is None (or `np.random`), the `numpy.random.RandomState`
        singleton is used.
        If `seed` is an int, a new ``RandomState`` instance is used,
        seeded with `seed`.
        If `seed` is already a ``Generator`` or ``RandomState`` instance then
        that instance is used.
"""

This is just a slightly modified version of the one from QMC. I would use this also to update the current docstrings of check_random_state of course (if I am not mistaking, this should be the same just with links and slight re-wording). This way we have the exact same docstrings everywhere.

I would prefer not to go through everything 10 times 😅

[EDIT: I started like this, hope its fine... 🤞 ]

@ev-br
Copy link
Member

ev-br commented Apr 1, 2021

One possible enhancement (in a separate PR!) could be to make check_random_state recognize a np.random.SeedSequence instance and seed a Generator with it.

@tupui
Copy link
Member Author

tupui commented Apr 1, 2021

One possible enhancement (in a separate PR!) could be to make check_random_state recognize a np.random.SeedSequence instance and seed a Generator with it.

I can do this afterwards sure 😉

@tupui tupui force-pushed the random_generator branch from 50022e3 to cbd08e4 Compare April 2, 2021 12:20
@tupui tupui force-pushed the random_generator branch from cbd08e4 to 62fba3f Compare April 2, 2021 12:58
@tupui
Copy link
Member Author

tupui commented Apr 2, 2021

@rkern @ilayn I integrated your suggestions and fixed the CI. Could you please have another look?

Copy link
Member

@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more very minor suggestions to consider.

Co-authored-by: Robert Kern <robert.kern@gmail.com>
@tupui
Copy link
Member Author

tupui commented Apr 2, 2021

Thank you for the review @rkern! Shall we merge then?

@rkern
Copy link
Member

rkern commented Apr 2, 2021

@mdhaber

Do you happen to know if this was an oversight (i.e. that people might want to use RandomState and Generator the same way) or was the name randint deemed too egregious to keep around?

The randint() API had accumulated a lot of cruft as we tacked on functionality and fixed deficiencies of the original implementation (e.g. to correctly handle the semantics of the non-inclusive upper bound for each dtype even though that bound cannot be expressed in that dtype) while maintaining backwards compatibility in RandomState. integers() is the cleaned-up version, how we would design the API given the current set of functionality if we had a fresh start. Because the API is different, it needed a different name. If it were solely up to me, I might still have kept the randint() method there (translating its calls to integers() internally) so that Generator would be more of a drop-in replacement, but using rng_integers() (which mimics the integers() interface and translates it to randint() when RandomState is used) is a better in the long-term as RandomState disappears. We won't keep the legacy of the crufty randint() interface forever.

@rkern
Copy link
Member

rkern commented Apr 2, 2021

@tupui This final form LGTM.

@tupui
Copy link
Member Author

tupui commented Apr 2, 2021

@tupui This final form LGTM.

Thanks. I propose to merge this tomorrow then.

@tupui tupui merged commit 08390dc into scipy:master Apr 3, 2021
@tupui tupui deleted the random_generator branch April 3, 2021 08:32
@mdhaber
Copy link
Contributor

mdhaber commented Apr 3, 2021

@rkern thanks. I hadn't noticed the API difference yet. rng_integers is a good bridge during the transition.

@sturlamolden
Copy link
Contributor

Status

Since NumPy 1.17, the canonical way to create random numbers is:

import numpy as np
from numpy.random import default_rng
rng = default_rng()

NumPy documentation says that the default generator can change in future version.

“The Generator is the user-facing object that is nearly identical to the legacy RandomState. It accepts a bit generator instance as an argument. The default is currently PCG64 but this may change in future versions.”

https://numpy.org/doc/stable/reference/random/index.html

This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified.

@tupui
Copy link
Member Author

tupui commented May 10, 2021

This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified.

Robert already replied in the PR.

On my side, I stand by this change. It would be totally up to me that there would be absolutely no seeding. It can give you a fake sense of things working although they are not.

Anyway, here the examples are user facing and yes NEP19 is explicitly advocating for using what we proposed here. It's a very little prize to pay to fix a few numbers in the unlikely even that the generator changes. FYI, it took me at most 1h to replace them all once I had the logic in place. So not really any discussions to have here. Furthermore, the tests are still using RandomState as NEP19 promotes.

Good hygiene around RNG is paramount and as a pilar of the statistical environment in Python, SciPy must be exemplary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants