Skip to content

Conversation

pvanmulbregt
Copy link
Contributor

Add size keyword to dist._rvs() rather than relying on dist._size.
Moves towards thread safety by passing in values rather than modifying the class.

gh-9900 made many stats usages more thread-safe, but didn't change distn._size, which gets set in every call to distn.rvs(). This PR stops the setting the distn._size member of the class.
distns.rvs() is not yet completely thread-safe as the random_state is stored in self._random_state.

Some questions:
size became a keyword argument to _rvs(self, *shape_args, size=... )
What should the default size be? None? 1? Something else?
If not set in the parent rv_generic.rvs() method, previously it generated one random variate (or perhaps one for each shape args passed in.)
In the actual distribution's _rvs method, it appears to always have a value in all the tests.
Is that in fact the case, that by the time a distribution is asked for _rvs(), it knows explicitly the size? Either by self._size prior to this PR, or a real size passed in for this PR.
Should size become a positional argument to each distn._rvs(), or would that break backwards compatibility? (Aside: Are _underscore methods part of the API that needs preserving?)

@josef-pkt
Copy link
Member

Aside: Are _underscore methods part of the API that needs preserving

largely yes. (I'm using them when I want to avoid the generic overhead, but maybe not for _rvs)

I think the main backwards compatibility problem might be with user subclasses of the distributions.
That needs some test cases, because a user might use _size in their _rvs method.
(I'm not sure because I haven't looked at the details in some time.)

@josef-pkt
Copy link
Member

My ancient code in the statsmodels sandbox uses the _size attribute when defining _rvs in a subclass.
AFAICS, statsmodels does not have subclasses outside of the sandbox that override _rvs.

But I guess my old code would be the standard pattern by packages that define their own distribution as subclasses of the scipy distributions.

That means that the _size attribute has to be deprecated before it can be removed.

@pvanmulbregt
Copy link
Contributor Author

I found 7 examples in statsmodels:

 $ find . -name "*.py" | xargs grep "\b_size\b"
./statsmodels/sandbox/distributions/extras.py:        u0 = stats.norm.rvs(size=self._size)
./statsmodels/sandbox/distributions/extras.py:        u1 = delta*u0 + np.sqrt(1-delta**2)*stats.norm.rvs(size=self._size)
./statsmodels/sandbox/distributions/extras.py:        V = stats.chi2.rvs(df, size=self._size)
./statsmodels/sandbox/distributions/extras.py:        z = skewnorm.rvs(alpha, size=self._size)
./statsmodels/sandbox/distributions/extras.py:        self.kls._size = self._size
./statsmodels/sandbox/distributions/extras.py:        self.kls._size = self._size   #size attached to self, not function argument
./statsmodels/sandbox/distributions/transformed.py:        self.kls._size = self._size   #size attached to self, not function argument

Not only will the examples fail because they reference self._size, they will fail even earlier because their signature doesn't allow any keyword arguments. E.g.

class SkewNorm_gen(distributions.rv_continuous):
    def _rvs(self, alpha):   # <-- No keywords, in particular no size=...
...

For the old code to work with this PR, it would need to be updated. Not sure I see an obvious coexistence/deprecation path. On the plus side, this PR change would not cause any silent failures...

@josef-pkt
Copy link
Member

searching for code on github is difficult because it mainly shows code of copies/forks of scipy

I found two in a brief search, that subclass and override _rvs using _size
https://github.com/4everer/GZY_JData/blob/124969442336daf0893c2d4137903662329b05f9/base/LogUniform.py
https://github.com/hildeweerts/hyperimp/blob/943b40978ba86a0c0b6d6303e9b562da38fb35d0/experiment_1/hyperimp/utils/distributions.py#L57

@josef-pkt
Copy link
Member

Looks like an api compatible copy for autograd
https://github.com/pyro-ppl/numpyro/blob/f3336eb15b2f311e2eed2a1cf9a00dd9f8d82cec/numpyro/contrib/distributions/continuous.py

Using leading underscore for the methods that can or need to be overridden by subclasses predates the convention of using it to signal private functions.

So, I think anything that is needed to subclass scipy distributions has to be considered as part of the public API with corresponding backwards compatibility constraints.

@josef-pkt
Copy link
Member

josef-pkt commented Jan 21, 2020

One possibility would be to try ... except protect the call to _rvs with the size keyword, and fall back to setting the _size attribute if that fails.

The main disadvantage is that it will slow down user subclasses if the code has to go through the except part on repeated calls. The main problem with the speed of scipy distribution rvs calls was with Bayesian packages that didn't call in a vectorized way (i.e. repeated calls with size=1 instead of a large number at once which has a lot of overhead).
However, python packages that do that might not exist anymore.

It wouldn't matter for statsmodels, because it's only in little used sandbox code, and we can adjust the code, and in vectorized calls the overhead of except should be relatively small.

@pvanmulbregt pvanmulbregt changed the title MAINT: Remove self._size from stats distributions. MAINT: Remove self._size (and self._random_state) from stats distributions. Jan 27, 2020
@pvanmulbregt
Copy link
Contributor Author

Updated the PR to include self._random_state in addition to self._size. I.e. The _rvs() call must have a signature like the following
def _rvs(self, *shape_args, size=?, random_state=?)
[The default values of size and random_state don't matter as they will always be provided from scipy.stats calls.]

Instead of try...except, I added a check on the signature similar to that done for 'moment' in the .stats() signature.

@WarrenWeckesser @josef-pkt You seem to have thought about this the most in the past. Thoughts on whether the change can be made?


MAINT: Fixed the thread safety of rv_generic.rvs().

Remove self._size and self._random_state from stats distributions.
Add size keyword to dist._rvs() rather than relying on dist._size.
Added random_state' as a keyword argument to the _rvs() method for all distributions. Provides thread safety by passing in values rather than modifying the class. They need to be keywords rather than arguments to avoid being treated as part of *args`.

This change could break existing code, as the _rvs() method for any class
inheriting from rv_generic will need to be updated.
All such classes inside SciPy have been updated.
Maintain backwards compatibility by
a) inspecting the signature of self._rvs() when the object is created,
and record whether it needs the _size set; and
b) inside the call to self.rvs(), check whether the self._size and
self._random_state attributes need to be set (after preserving the current
values for later restoration.)

Any distributions with the new API should be thread safe and repeatable if
the client calls dist.rvs() passing in a random_state. (Using the default
random_state will always be subject to threading issues.)

@tylerjereddy tylerjereddy added the maintenance Items related to regular maintenance tasks label Jan 27, 2020
@WarrenWeckesser
Copy link
Member

FYI: Some comments from an old PR:

#5526 (comment)

and later

#5526 (comment)

@WarrenWeckesser
Copy link
Member

I probably won't review this PR until the weekend. But I'd be happy if we can make the change reasonably cleanly while preserving backwards compatibility.

@pvanmulbregt
Copy link
Contributor Author

Travis-CI ppc64le build failed due to failure to install numpy.

@pvanmulbregt
Copy link
Contributor Author

Rebased after CI failures from gh-11460 were addressed, and fixed the conflicts with gh-11402

@pvanmulbregt
Copy link
Contributor Author

Operator error on the last commit: testing had picked up the installed (different) build which succeeded.

@pvanmulbregt
Copy link
Contributor Author

@WarrenWeckesser or @josef-pkt Any further thoughts? I believe that the PR is backwards compatible for any external distributions (emitting a one-time visible DeprecationWarning) and cleans up all our internal distributions.

@rgommers rgommers added this to the 1.5.0 milestone Mar 17, 2020
@rgommers
Copy link
Member

Resolved one merge conflict and ran the tests locally - I see three warnings from test_continuous_basic.py:

================================== warnings summary ===================================
tests/test_continuous_basic.py::test_cont_basic[argus-arg3]
  /home/rgommers/code/scipy/scipy/stats/_distn_infrastructure.py:995: VisibleDeprecationWarning: The signature of <bound method argus_gen._rvs of <scipy.stats._continuous_distns.argus_gen object at 0x7f1bc2bf0be0>> does not contain a "size" keyword.  Such signatures are deprecated.
    np.VisibleDeprecationWarning)

tests/test_continuous_basic.py::test_cont_basic[argus-arg3]
  /home/rgommers/code/scipy/scipy/stats/_distn_infrastructure.py:995: VisibleDeprecationWarning: The signature of <bound method argus_gen._rvs of <scipy.stats._continuous_distns.argus_gen object at 0x7f1bbfb64f98>> does not contain a "size" keyword.  Such signatures are deprecated.
    np.VisibleDeprecationWarning)

tests/test_continuous_basic.py::test_cont_basic[kstwo-arg59]
  /home/rgommers/code/scipy/scipy/stats/_continuous_distns.py:152: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    return 0.5/(n if not isinstance(n, collections.Iterable) else np.asanyarray(n)), 1.0

Looks about ready other than that.

@rgommers
Copy link
Member

Instead of try...except, I added a check on the signature similar to that done for 'moment' in the .stats() signature.

This looks good to me. One thing that would be good to do is to add one of those older subclasses using _size that you found and add it as a regression test (incl assert_warns).

@WarrenWeckesser
Copy link
Member

@WarrenWeckesser wrote on Jan 27

I probably won't review this PR until the weekend. ...

Phew, that was a long weekend. 😳

This is a great change, thanks. I tried it using a subclass of rv_continuous that I've been working on that uses the _size attribute. It worked fine and gave the expected deprecation warning.

One thought: a typical _rvs implementation is now (using expon as the example):

def _rvs(self, size=1, random_state=None):
    return random_state.standard_exponential(size)

That use of random_state when the default is None is a bit funky. If someone calls _rvs() directly and doesn't supply random_state, they get the error

AttributeError: 'NoneType' object has no attribute 'standard_exponential'

which is a bit cryptic. I get that the idea is _rvs() shouldn't be called like that. Subclasses are expected to override _rvs() where necessary, but calling it is the responsibility of rvs(). So I guess it is OK. Just a thought: if we switch the order of the new arguments, and removed the default value of random_state

def _rvs(self, random_state, size=1):
    return random_state.standard_exponential(size)

Then a direct call to _rvs() without the random_state argument will give an error like:

TypeError: _rvs() missing 1 required positional argument: 'random_state'

which is a bit more helpful. What do you think?

@pvanmulbregt
Copy link
Contributor Author

I agree it is not ideal. None is definitely not a valid value for random_state.
Looking through the code I had found existing functions call with the arguments in the size/random_state order (E.g. chi.rvs calls chi2.rvs(df, size=sz, random_state=rndm), and all the existing rvs_scalar implementations do the same) and I didn't want to change that.
Additionally the _multivariate.py rvs methods use the size/random order. And even worse, they have other keyword arguments occurring before the size! E.g. multivariate_normal_gen has def rvs(self, mean=None, cov=1, size=1, random_state=None)

My inclination is to leave them both as keywords, and accept that incorrect usage may not always produce an optimal message.

Remove self._size and self._random_state from stats distributions.
Add 'size' keyword to dist._rvs() rather than relying on dist._size.
Added 'random_state' as a keyword argument to the _rvs() method for all
distributions.
Provides thread safety by passing in values rather than modifying the class.
They need to be keywords rather than arguments to avoid being treated as
part of '*args'.

This change could break existing code, as the _rvs() method for any class
inheriting from rv_generic will need to be updated.
All such classes inside SciPy have been updated.
Maintain backwards compatibility by
a) inspecting the signature of self._rvs() when the object is created,
   and record whether it needs the _size set; and
b) inside the call to self.rvs(), check whether the self._size and
   self._random_state attributes need to be set (after preserving the current
   values for later restoration.)

Any distributions with the new API should be thread safe and repeatable if
the client calls dist.rvs() passing in a random_state. (Using the default
random_state will always be subject to threading issues.)
_discrete_distns.py:
 self._random_state -> random_state
_distn_infrastructure.py:
 random_state.uniform(size) -> random_state.uniform(size=size)
Modified truncnorm._rvs signature to match current API.
The expected signature is:
def _rvs(self, *args, size=None, random_state=None):

If using typing, it would be declared as size: Optional[Tuple[int]] = None.
The parent rvs in rv_generic will always pass a tuple for the size argument,
which may be empty.
@pvanmulbregt
Copy link
Contributor Author

Added a regression test for the older subclasses per @rgommers suggestion.

Made the default for size be None in the _rvs methods.
def _rvs(self, size=None, random_state=None):
The main rvs method will never call with size=None, it will always be a tuple. But using None as the default matches the signature of rv_generic._rvs method which is being overridden, and is better than size=1 which might be seen to imply that size could be an integer.

@rgommers
Copy link
Member

rgommers commented May 8, 2020

My inclination is to leave them both as keywords, and accept that incorrect usage may not always produce an optimal message.

I agree

@rgommers rgommers merged commit 2ddbe38 into scipy:master May 8, 2020
@rgommers
Copy link
Member

rgommers commented May 8, 2020

This looks great, let's get it in to give it a couple of weeks in master before branching 1.5.x. Thanks @pvanmulbregt! And thanks @WarrenWeckesser and @josef-pkt for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants