-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
DOC: remove seed in examples #13863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: remove seed in examples #13863
Conversation
Already fixing the issues. Find from PyCharm is having trouble as the repo is huge, so it's not displaying everything all the time... [EDIT:] I think it's ok now. I search using the find window which appeared to be searching everywhere this time. Also CI is green so ready for review 😄 |
b07b3a7
to
951e90d
Compare
951e90d
to
3a6b503
Compare
Are these all? I remember seeing a lot of docs using |
Wait. It's fine to remove the seed if we have a better way to fix reproducible results. I'm -1 on leaving things dangling randomly regardless of how simple they are. If we are going to touch 50 files at least we should make sure that we do it once. |
I had this thought too but after #13778 got merged, I thought this is the direction we are headed to. |
I thought that was just the randomstate instances. I missed any seed removal on that PR then. |
From the discussion in the other PR we actually tried to not have any explicit seeding like 12345. As noted in NumPy's doc and now in the stats' tutorial of our doc, doing this initialize poorly the generator. Also, everyone uses common values like 0, 12345. Hence it leads to bias. Some people just copy-paste code (seen in production or research code all the time, regardless of the language/tech...). So how we write our examples is paramount. |
I think. I did several pass of search trying to catch everything. |
Thanks! Sorry for the |
No worries. It's easier to change than it was to add ;) I will just wait to have to go from another maintainer to make sure we are all on the same page. |
I get the idea in production code. But as we discussed there I don't care about the bias in the examples. If we cannot set seeds or whatever the new tech calls it, I'm against for any change since that is a big regression. |
In fact if there is no simple alternative to How we write tests are indeed paramount, but not checking what are tests are doing is not acceptable either. |
@@ -94,10 +94,10 @@ simplicity, we'll construct a symmetric, positive-definite matrix. | |||
>>> from scipy.linalg import eig, eigh | |||
>>> from scipy.sparse.linalg import eigs, eigsh | |||
>>> np.set_printoptions(suppress=True) | |||
>>> rng = np.random.default_rng() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get reproducible results by seeding the default_rng
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but see #13863 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I just found back #13631 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't buy the idea that someone would copy code with random inputs. Surely they would use their own inputs if they're actually trying to achieve something other than learning the scipy function (in which case random quality is irrelevant).
However if it's really an issue, you could always use default_rng(seed=SEED)
where SEED
is defined in the doctest runner, so not part of the example.
Lines 468 to 469 in 374e5bf
# the namespace to run examples in | |
DEFAULT_NAMESPACE = {'np': np} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However if it's really an issue, you could always use
default_rng(seed=SEED)
whereSEED
is defined in the doctest runner, so not part of the example.
Hmm, that would be a pleasing middle ground.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I think if you ran make clean; make html
twice you'd get the same output, and with this PR you won't (better for reproducibility, smaller diffs with doc uploads to version controlled things, etc.). At the same time, I see the point about wanting to set a good example (which I agree is more important).
Another option is for for us to hack the doc building in some way to give us reproducible outputs. For example, in conf.py
we could temporarily monkeypatch numpy.random.default_rng
to be lambda seed=0: original_default_rng(seed)
. That way our examples stay static, but readers just see us use rng = numpy.random.default_rng()
in examples. I can look into this and open a PR to do this separately if people agree it solves our problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ark just seeing this. I started to add the SEED 😅. Well it's not for nothing if we take your approach @larsoner as I basically would have the new values for the tests... If we do this, you could commit to my branch directly. Or first we would use # random
and then introduce this hack?
I will wait a bit for the decision then (+1 for @rgommers with the # random
. In the end, I believe it could mean less technical debt as we would not need to maintain changing values there. Plus a new hack.).
Just let me know 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, you could commit to my branch directly.
This seems like the easiest approach. Let's wait to see if others agree it's the best compromise.
Also I do agree with @rgommers that we will cause copy-paste problems for people with the SEED approach, so I prefer the hack-the-doc-build approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for hack-the-doc-build, good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great 😅 @larsoner I will push very soon what I did with SEED
. Basically I added this in the refguide_check.py
# SEED is generated with
# from numpy.random import SeedSequence
# print(SeedSequence().entropy)
DEFAULT_NAMESPACE = {'np': np, 'SEED': 1638083107694713882823079058616272161}
Then if I understand correctly your trick, please use this value when replacing >>> rng = np.random.default_rng()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted a couple more
2111fca
to
3ec3192
Compare
Co-authored-by: peterbell10 <peterbell10@live.co.uk> DOC: more seeding Apply suggestions from code review Co-authored-by: peterbell10 <peterbell10@live.co.uk> DOC: more seeding
3ec3192
to
16e7988
Compare
Thanks @peterbell10. Originally I did not look much as the plots or things which did not cause doctests to fail. I fixed these now. I also went through again I had to put a [ps: the removing of reviewer is not intentional, I am on the iOS app and adding a reviewer removed some 🤷♂️] |
@rgommers do you want to have a final look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too and CI is all green. Let's give this a go, thanks @tupui! And thanks all reviewers!
Thanks Ralf 😃 and thanks everyone for the review and the discussion! |
Anyone else on Windows having trouble running |
That doesn't sound familiar. Would be better to open a separate issue for it. |
Sadlly, no; refguide-check only works per-submodule, so |
NumPy documentation says that the default generator can change in future version. “The Generator is the user-facing object that is nearly identical to the legacy RandomState. It accepts a bit generator instance as an argument. The default is currently PCG64 but this may change in future versions.” https://numpy.org/doc/stable/reference/random/index.html This “canonical” way of producing random variates can therefore break the test suite and the doctests. Generator and seed should be explicitly specified. |
We've already discussed this in this PR. I don't intend to rehash the whole conversation (you can read it at your leisure), but in brief: We don't use doctests in our test suite to test the functionality of our code. We use explicitly-written unit tests for that purpose. Our examples are written to document how to use scipy, and thus are written according to the current recommended practice. If an upstream change changes the output of the examples meaningfully, we'll just update the examples. |
This is a follow up of #13778
Some example were not updated correctly to remove the seeding.