-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
MAINT: Remove self._size (and self._random_state) from stats distributions. #11394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
largely yes. (I'm using them when I want to avoid the generic overhead, but maybe not for I think the main backwards compatibility problem might be with user subclasses of the distributions. |
My ancient code in the statsmodels sandbox uses the But I guess my old code would be the standard pattern by packages that define their own distribution as subclasses of the scipy distributions. That means that the |
I found 7 examples in statsmodels:
Not only will the examples fail because they reference
For the old code to work with this PR, it would need to be updated. Not sure I see an obvious coexistence/deprecation path. On the plus side, this PR change would not cause any silent failures... |
searching for code on github is difficult because it mainly shows code of copies/forks of scipy I found two in a brief search, that subclass and override |
Looks like an api compatible copy for autograd Using leading underscore for the methods that can or need to be overridden by subclasses predates the convention of using it to signal private functions. So, I think anything that is needed to subclass scipy distributions has to be considered as part of the public API with corresponding backwards compatibility constraints. |
One possibility would be to The main disadvantage is that it will slow down user subclasses if the code has to go through the It wouldn't matter for statsmodels, because it's only in little used sandbox code, and we can adjust the code, and in vectorized calls the overhead of |
4242f73
to
7dd1df5
Compare
Updated the PR to include Instead of @WarrenWeckesser @josef-pkt You seem to have thought about this the most in the past. Thoughts on whether the change can be made? MAINT: Fixed the thread safety of Remove This change could break existing code, as the Any distributions with the new API should be thread safe and repeatable if |
FYI: Some comments from an old PR: and later |
I probably won't review this PR until the weekend. But I'd be happy if we can make the change reasonably cleanly while preserving backwards compatibility. |
Travis-CI ppc64le build failed due to failure to install numpy. |
7dd1df5
to
66d59eb
Compare
Operator error on the last commit: testing had picked up the installed (different) build which succeeded. |
@WarrenWeckesser or @josef-pkt Any further thoughts? I believe that the PR is backwards compatible for any external distributions (emitting a one-time visible DeprecationWarning) and cleans up all our internal distributions. |
Resolved one merge conflict and ran the tests locally - I see three warnings from
Looks about ready other than that. |
This looks good to me. One thing that would be good to do is to add one of those older subclasses using |
@WarrenWeckesser wrote on Jan 27
Phew, that was a long weekend. 😳 This is a great change, thanks. I tried it using a subclass of One thought: a typical
That use of
which is a bit cryptic. I get that the idea is
Then a direct call to
which is a bit more helpful. What do you think? |
I agree it is not ideal. My inclination is to leave them both as keywords, and accept that incorrect usage may not always produce an optimal message. |
Remove self._size and self._random_state from stats distributions. Add 'size' keyword to dist._rvs() rather than relying on dist._size. Added 'random_state' as a keyword argument to the _rvs() method for all distributions. Provides thread safety by passing in values rather than modifying the class. They need to be keywords rather than arguments to avoid being treated as part of '*args'. This change could break existing code, as the _rvs() method for any class inheriting from rv_generic will need to be updated. All such classes inside SciPy have been updated. Maintain backwards compatibility by a) inspecting the signature of self._rvs() when the object is created, and record whether it needs the _size set; and b) inside the call to self.rvs(), check whether the self._size and self._random_state attributes need to be set (after preserving the current values for later restoration.) Any distributions with the new API should be thread safe and repeatable if the client calls dist.rvs() passing in a random_state. (Using the default random_state will always be subject to threading issues.)
_discrete_distns.py: self._random_state -> random_state _distn_infrastructure.py: random_state.uniform(size) -> random_state.uniform(size=size)
Import Iterable from collections.abc.
Modified truncnorm._rvs signature to match current API.
… method is missing a size argument.
The expected signature is: def _rvs(self, *args, size=None, random_state=None): If using typing, it would be declared as size: Optional[Tuple[int]] = None. The parent rvs in rv_generic will always pass a tuple for the size argument, which may be empty.
Added a regression test for the older subclasses per @rgommers suggestion. Made the default for |
I agree |
This looks great, let's get it in to give it a couple of weeks in master before branching 1.5.x. Thanks @pvanmulbregt! And thanks @WarrenWeckesser and @josef-pkt for reviewing. |
Add
size
keyword todist._rvs()
rather than relying ondist._size
.Moves towards thread safety by passing in values rather than modifying the class.
gh-9900 made many stats usages more thread-safe, but didn't change
distn._size
, which gets set in every call todistn.rvs()
. This PR stops the setting thedistn._size
member of the class.distns.rvs()
is not yet completely thread-safe as the random_state is stored inself._random_state
.Some questions:
size
became a keyword argument to_rvs(self, *shape_args, size=... )
What should the default size be?
None
?1
? Something else?If not set in the parent
rv_generic.rvs()
method, previously it generated one random variate (or perhaps one for each shape args passed in.)In the actual distribution's
_rvs
method, it appears to always have a value in all the tests.Is that in fact the case, that by the time a distribution is asked for
_rvs()
, it knows explicitly the size? Either byself._size
prior to this PR, or a realsize
passed in for this PR.Should
size
become a positional argument to eachdistn._rvs()
, or would that break backwards compatibility? (Aside: Are _underscore methods part of the API that needs preserving?)