Skip to content

Conversation

tupui
Copy link
Member

@tupui tupui commented Apr 14, 2021

This is a follow up of #13778

Some example were not updated correctly to remove the seeding.

@tupui tupui added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Apr 14, 2021
@tupui
Copy link
Member Author

tupui commented Apr 14, 2021

Already fixing the issues. Find from PyCharm is having trouble as the repo is huge, so it's not displaying everything all the time...

[EDIT:] I think it's ok now. I search using the find window which appeared to be searching everywhere this time. Also CI is green so ready for review 😄

@tupui tupui force-pushed the random_seed_examples branch from 951e90d to 3a6b503 Compare April 14, 2021 13:47
@tupui tupui added this to the 1.7.0 milestone Apr 14, 2021
@tirthasheshpatel
Copy link
Member

Are these all? I remember seeing a lot of docs using seed in stats.

@ilayn
Copy link
Member

ilayn commented Apr 15, 2021

Wait. It's fine to remove the seed if we have a better way to fix reproducible results. I'm -1 on leaving things dangling randomly regardless of how simple they are. If we are going to touch 50 files at least we should make sure that we do it once.

@tirthasheshpatel
Copy link
Member

tirthasheshpatel commented Apr 15, 2021

Wait. It's fine to remove the seed if we have a better way to fix reproducible results. I'm -1 on leaving things dangling randomly.

I had this thought too but after #13778 got merged, I thought this is the direction we are headed to. np.random.default_rng allows setting the seed though: rng = np.random.default_rng(12345). What do you think about doing it @tupui?

@ilayn
Copy link
Member

ilayn commented Apr 15, 2021

I thought that was just the randomstate instances. I missed any seed removal on that PR then.

@tupui
Copy link
Member Author

tupui commented Apr 15, 2021

Wait. It's fine to remove the seed if we have a better way to fix reproducible results. I'm -1 on leaving things dangling randomly.

I had this thought too but after #13778 got merged, I thought this is the direction we are headed to. np.random.default_rng allows setting the seed though: rng = np.random.default_rng(12345). What do you think about doing it @tupui?

From the discussion in the other PR we actually tried to not have any explicit seeding like 12345. As noted in NumPy's doc and now in the stats' tutorial of our doc, doing this initialize poorly the generator. Also, everyone uses common values like 0, 12345. Hence it leads to bias. Some people just copy-paste code (seen in production or research code all the time, regardless of the language/tech...). So how we write our examples is paramount.

@tupui
Copy link
Member Author

tupui commented Apr 15, 2021

Are these all? I remember seeing a lot of docs using seed in stats.

I think. I did several pass of search trying to catch everything.

@tirthasheshpatel
Copy link
Member

tirthasheshpatel commented Apr 15, 2021

I think. I did several pass of search trying to catch everything.

Thanks! Sorry for the # random to # may vary tedious change. I don't have any major comments other than that for now.

@tupui
Copy link
Member Author

tupui commented Apr 15, 2021

I think. I did several pass of search trying to catch everything.

Thanks! Sorry for the # random to # may vary tedious change. I don't have any major comments other than that for now.

No worries. It's easier to change than it was to add ;) I will just wait to have to go from another maintainer to make sure we are all on the same page.

@ilayn
Copy link
Member

ilayn commented Apr 15, 2021

I get the idea in production code. But as we discussed there I don't care about the bias in the examples. If we cannot set seeds or whatever the new tech calls it, I'm against for any change since that is a big regression.

@ilayn
Copy link
Member

ilayn commented Apr 15, 2021

In fact if there is no simple alternative to numpy.random.seed() that is an unfortunate change for all numpy/scipy. Besides, doctests are very important for catching regressions so this PR effectively removes test which is probably not what we want in the first place.

How we write tests are indeed paramount, but not checking what are tests are doing is not acceptable either.

@@ -94,10 +94,10 @@ simplicity, we'll construct a symmetric, positive-definite matrix.
>>> from scipy.linalg import eig, eigh
>>> from scipy.sparse.linalg import eigs, eigsh
>>> np.set_printoptions(suppress=True)
>>> rng = np.random.default_rng()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get reproducible results by seeding the default_rng?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but see #13863 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I just found back #13631 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't buy the idea that someone would copy code with random inputs. Surely they would use their own inputs if they're actually trying to achieve something other than learning the scipy function (in which case random quality is irrelevant).

However if it's really an issue, you could always use default_rng(seed=SEED) where SEED is defined in the doctest runner, so not part of the example.

# the namespace to run examples in
DEFAULT_NAMESPACE = {'np': np}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However if it's really an issue, you could always use default_rng(seed=SEED) where SEED is defined in the doctest runner, so not part of the example.

Hmm, that would be a pleasing middle ground.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think if you ran make clean; make html twice you'd get the same output, and with this PR you won't (better for reproducibility, smaller diffs with doc uploads to version controlled things, etc.). At the same time, I see the point about wanting to set a good example (which I agree is more important).

Another option is for for us to hack the doc building in some way to give us reproducible outputs. For example, in conf.py we could temporarily monkeypatch numpy.random.default_rng to be lambda seed=0: original_default_rng(seed). That way our examples stay static, but readers just see us use rng = numpy.random.default_rng() in examples. I can look into this and open a PR to do this separately if people agree it solves our problems.

Copy link
Member Author

@tupui tupui Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ark just seeing this. I started to add the SEED 😅. Well it's not for nothing if we take your approach @larsoner as I basically would have the new values for the tests... If we do this, you could commit to my branch directly. Or first we would use # random and then introduce this hack?

I will wait a bit for the decision then (+1 for @rgommers with the # random. In the end, I believe it could mean less technical debt as we would not need to maintain changing values there. Plus a new hack.).

Just let me know 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, you could commit to my branch directly.

This seems like the easiest approach. Let's wait to see if others agree it's the best compromise.

Also I do agree with @rgommers that we will cause copy-paste problems for people with the SEED approach, so I prefer the hack-the-doc-build approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for hack-the-doc-build, good idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great 😅 @larsoner I will push very soon what I did with SEED. Basically I added this in the refguide_check.py

# SEED is generated with
# from numpy.random import SeedSequence
# print(SeedSequence().entropy)
DEFAULT_NAMESPACE = {'np': np, 'SEED': 1638083107694713882823079058616272161}

Then if I understand correctly your trick, please use this value when replacing >>> rng = np.random.default_rng().

Copy link
Member

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a couple more

@tupui tupui force-pushed the random_seed_examples branch 2 times, most recently from 2111fca to 3ec3192 Compare April 17, 2021 13:11
Co-authored-by: peterbell10 <peterbell10@live.co.uk>

DOC: more seeding

Apply suggestions from code review

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

DOC: more seeding
@tupui tupui force-pushed the random_seed_examples branch from 3ec3192 to 16e7988 Compare April 17, 2021 13:33
@tupui
Copy link
Member Author

tupui commented Apr 17, 2021

Thanks @peterbell10. Originally I did not look much as the plots or things which did not cause doctests to fail. I fixed these now. I also went through again np.random.seed/rand/randint/randn/random/uniform.

I had to put a # may vary for multivariate_normal as seeding is not sufficient as described in multiple issues (numpy/numpy#13597, numpy/numpy#8041).

[ps: the removing of reviewer is not intentional, I am on the iOS app and adding a reviewer removed some 🤷‍♂️]

@tupui tupui requested review from rgommers and peterbell10 April 17, 2021 15:28
@tupui
Copy link
Member Author

tupui commented Apr 18, 2021

@rgommers do you want to have a final look?

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too and CI is all green. Let's give this a go, thanks @tupui! And thanks all reviewers!

@rgommers rgommers merged commit 9492369 into scipy:master Apr 18, 2021
@tupui
Copy link
Member Author

tupui commented Apr 18, 2021

Thanks Ralf 😃 and thanks everyone for the review and the discussion!

@tupui tupui deleted the random_seed_examples branch April 18, 2021 19:43
@mdhaber
Copy link
Contributor

mdhaber commented Apr 19, 2021

Anyone else on Windows having trouble running refguide_check locally? I'm consistently getting a MemoryError in the middle of the stats refguide_check.
Is there a verbose output option so I can tell which test is failing? python runtests.py -g --refguide-check -v doesn't do it.
Is there a way of running the check for a single file so I can avoid the error? python runtests.py -g --refguide-check -t scipy.stats._entropy doesn't do it.

@rgommers
Copy link
Member

That doesn't sound familiar. Would be better to open a separate issue for it.

@ev-br
Copy link
Member

ev-br commented Apr 19, 2021

python runtests.py -g --refguide-check -t scipy.stats._entropy doesn't do it.

Sadlly, no; refguide-check only works per-submodule, so -s stats should work but -t scipy.stats._entropy won't.

@sturlamolden
Copy link
Contributor

Status

Since NumPy 1.17, the canonical way to create random numbers is:

import numpy as np
from numpy.random import default_rng
rng = default_rng()

NumPy documentation says that the default generator can change in future version.

“The Generator is the user-facing object that is nearly identical to the legacy RandomState. It accepts a bit generator instance as an argument. The default is currently PCG64 but this may change in future versions.”

https://numpy.org/doc/stable/reference/random/index.html

This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified.

@rkern
Copy link
Member

rkern commented May 10, 2021

We've already discussed this in this PR. I don't intend to rehash the whole conversation (you can read it at your leisure), but in brief: We don't use doctests in our test suite to test the functionality of our code. We use explicitly-written unit tests for that purpose. Our examples are written to document how to use scipy, and thus are written according to the current recommended practice. If an upstream change changes the output of the examples meaningfully, we'll just update the examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants