Skip to content

Conversation

WarrenWeckesser
Copy link
Member

This is a fix for #2069

I haven't added any new tests yet, but the existing stats test suite passes on my system (Mac OS X 10.9.5, python 2.7.10, numpy 1.10.1). Here are some examples using the updated rvs method:

In [1]: from scipy.stats import norm

In [2]: norm.rvs(loc=np.zeros(5))
Out[2]: array([-0.98069527, -0.75010438,  0.22639067, -0.49124641,  1.29047549])

In [3]: norm.rvs(loc=[1, 10, 100], scale=[0.1, 0.01, 1])
Out[3]: array([  1.06500076,   9.99321314,  99.47675459])

In [4]: norm.rvs(loc=[1, 10, 100], scale=[0.1, 0.01, 1], size=3)
Out[4]: array([  1.04240551,   9.9775307 ,  99.0535618 ])

In [5]: norm.rvs(loc=[1, 10, 100], scale=[0.1, 0.01, 1], size=(3,))
Out[5]: array([   1.16531601,    9.99863233,  100.38931409])

In [6]: norm.rvs(loc=[1, 10, 100], scale=[0.1, 0.01, 1], size=(2, 3))
Out[6]: 
array([[   1.01646612,    9.98911665,  100.43066246],
       [   1.11037045,   10.01307627,  102.03790766]])

Argument with incompatible shapes (e.g. norm.rvs([1, 2, 3], [4, 5]), gamma.rvs(array([2.0, 5.0, 10.0, 15.0]), size=(2,2))) will raise a ValueError. The compatibility test is the same as that used by the random variate generators in numpy.random. The rule is that, if size is given, it determines the shape of the output--broadcasting can not change the output size.

@WarrenWeckesser
Copy link
Member Author

Side note: I would really like to eliminate the use of the _size attribute of the distribution classes. It is set only in the rvs method, and accessed only in the _rvs method, so it should simply be an argument to _rvs. Making it an attribute is pointless. However, the docstring for rv_continuous includes _rvs in the methods that subclasses can override, so we're stuck with the _size attribute for backwards compatibility.

@rgommers rgommers added the maintenance Items related to regular maintenance tasks label Nov 24, 2015
@rgommers
Copy link
Member

Yep, sub-optimal design but no good way to get rid of self._size. It shouldn't get in the way of this fix though right?

@rgommers
Copy link
Member

Change is nicely local, called from only one place. So should be able to get it reviewed/merged without too large risks right before a release. Existing test coverage for non-scalar loc/scale/size is very thin at the moment though, needs expanding.

@WarrenWeckesser
Copy link
Member Author

Yep, sub-optimal design but no good way to get rid of self._size. It shouldn't get in the way of this fix though right?

Right, the code in this PR leaves the use of ._size as it is. In an early draft, I got rid of it, but then realized _rvs is effectively a public part of the API, so I undid those changes (grumble, grumble).

Existing test coverage for non-scalar loc/scale/size is very thin at the moment though, needs expanding.

Working on it...

@WarrenWeckesser
Copy link
Member Author

Updated with some tests. I'm still thinking about some more thorough tests. Suggestions are welcome!

This is still WIP. The pearson3 implementation needs some work. The Pearson Type III distribution can be expressed as a shifted gamma distribution, so I'm looking into rewriting pearson3 in terms of gamma (instead of trying to fix pearson3.rvs()).

@ev-br
Copy link
Member

ev-br commented Nov 27, 2015

I quickly drafted a test based on the stackoverflow thread, which is sensitive to the order of broadcasting and drawing variates: https://github.com/ev-br/scipy/tree/pr/5526
The relevant commit is ev-br@8c07855, the second commit tries to cover discrete distributions as well, but it likely doesn't do it.

In master, this fails all over. With this PR, I only see a couple of failures in truncnorm which does if self.a > 0 somewhere.
[and that is quite a Pandora's box].

This test is just a suggestion.

EDIT: Also array-valued shapes: ev-br@fda4518

@WarrenWeckesser
Copy link
Member Author

Thanks! It occurred to me that another simple test would be to use np.vectorize to create a function that only calls the rvs method with scalars. Then the test would compare the results of the vectorize'd method with the actual method; if the random seed is set before each call, they should return the same array.

@WarrenWeckesser
Copy link
Member Author

Then the test would compare the results of the vectorize'd method with the actual method; if the random seed is set before each call, they should return the same array.

Actually, this isn't true, because some of the rvs methods use two random numbers from the random number generator to generate one random variate of the distribution. For example, betaprime._rvs is

def _rvs(self, a, b):
    sz, rndm = self._size, self._random_state
    u1 = gamma.rvs(a, size=sz, random_state=rndm)
    u2 = gamma.rvs(b, size=sz, random_state=rndm)
    return (u1 / u2)

Calling that function several times with scalars will not produce the same variates as calling it once with broadcast arguments or with size != 1. That's not a bug in betaprime. It's just an implementation detail, but it means the idea of comparing np.vectorize'd results to actual broadcast results won't work in general. The good news is that all the continuous distributions in the dists list in test_distributions.py passed the test except for betaprime, dgamma, exponnorm, nct, dweibull, and rice (all of which use more than one random number per variate) and for pearson3 (which I am still skipping).

@WarrenWeckesser
Copy link
Member Author

With this PR, I only see a couple of failures in truncnorm which does if self.a > 0 somewhere.

truncnorm uses if self.a > 0 twice. That can be fixed by using np.where or _lazywhere. I'll include a fix for truncnorm when I update the PR.

@ev-br
Copy link
Member

ev-br commented Nov 28, 2015

It would also be helpful to have a test explicitly comparing the output of rvs and a corresponding numpy generator for several distributions, especially with those which take non-empty shape parameters. Eg. beta or binom.

@ev-br
Copy link
Member

ev-br commented Nov 28, 2015

Regarding the size handling, I guess we could (i) add size parameter to the signature of _rvs method of distributions we have in scipy.stats, (ii) keep the old code path and deprecate it in the docs and release notes with the intent of removing it in a couple of releases, (iii) in code, inspect the signature of self._rvs and use either the old or the new code paths (this can be similar to how _stats_has_moments is handled).

This can be done in a separate PR, of course :-). I just think it would be nice to change both things in one release if possible.


def test_all_rvs_broadcast_shape():
# This test only checks the *shape* of the value returned by rvs().
for dist in dists:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that dists is a hard coded list which does not include all distributions IIRC. (I actually wonder why this is so, but that's off-topic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving the tests over to test_continuous_basic.py and test_discrete_basic.py, where they will use the lists distcont and distdiscrete.

(I actually wonder why this is so [...])

I do that a lot when working with the distributions code. :)

@WarrenWeckesser
Copy link
Member Author

Even without the python 3 test failures, this is still WIP. I haven't touched pearson3 yet.

@ev-br I didn't use your commit, but putting the test "kernel" in common_tests.py was definitely the way to go. I think the tests now cover what you had in your commits, and more.

@WarrenWeckesser
Copy link
Member Author

@rgommers: In case it isn't obvious: this PR won't be ready in time for 0.17. Making the rvs() arguments work correctly might end up requiring that all the distribution methods properly broadcast. In theory, that's supposed to be the case already (see, for example, http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#broadcasting). In practice, not so much.

@rgommers
Copy link
Member

So we can remove the 0.17.0 label from gh-2069?

@WarrenWeckesser
Copy link
Member Author

So we can remove the 0.17.0 label from gh-2069?

Yup, I just did.

@pwaller
Copy link

pwaller commented Mar 3, 2016

Any time scale for completing this? Just hit the bug too.

@WarrenWeckesser
Copy link
Member Author

@pwaller I've been on a bit of a hiatus from scipy development, but I plan to get this into 0.18.

@pwaller
Copy link

pwaller commented Mar 4, 2016

In the meantime, what is the best workaround which won't break once this has landed? Is there any? I'm trying to generate many rvs with many values of parameters, e.g: beta.rvs([1, 2], [1, 2], size=(10, 2)) (I want 10 rvs each for 2 parameters). But I can't find any series of parameters which make that work. I guess I have to live with a non-vectorized implementation for now?

Edit: in the end I settled with array(list(beta(a, b).rvs(size=10000) for a, b in zip(As, Bs)))which is moderately fast, taking ~500ms for ~100 sets of parameters.

@WarrenWeckesser WarrenWeckesser force-pushed the stats-rvs-broadcasting branch from d56fe59 to beb7a7a Compare May 18, 2016 18:14
@WarrenWeckesser
Copy link
Member Author

Despite my hiatus in scipy contributions, I haven't forgotten about this.

I think the tentative plan is to release 0.18 this month. Is there a more concrete timeline for 0.18?

@larsoner
Copy link
Member

larsoner commented Jun 3, 2016

The milestone page lists June 14, but judging based on the number of PRs marked for 0.18, I'm skeptical :) I don't know of anything more concrete than that, at least.

@WarrenWeckesser
Copy link
Member Author

Thanks Eric. Time to dive back into this...

@WarrenWeckesser WarrenWeckesser force-pushed the stats-rvs-broadcasting branch 2 times, most recently from e7bd655 to e7a41ca Compare June 17, 2016 23:09
@WarrenWeckesser WarrenWeckesser changed the title WIP: Fix handling of size and broadcasting in the rvs method. BUG: stats: Fix broadcasting in the rvs() method of the distributions. Jun 17, 2016
@WarrenWeckesser WarrenWeckesser force-pushed the stats-rvs-broadcasting branch from e7a41ca to ffed817 Compare June 18, 2016 00:56
@WarrenWeckesser WarrenWeckesser added this to the 0.18.0 milestone Jun 18, 2016
@WarrenWeckesser
Copy link
Member Author

Updated.

@rgommers rgommers removed the needs-work Items that are pending response from the author label Jun 18, 2016
@ev-br
Copy link
Member

ev-br commented Jun 18, 2016

This would need a mention in the release notes, in big bold letters.

@WarrenWeckesser
Copy link
Member Author

@ev-br I added a note to the release notes about the bug fix. It is in the "Backwards incompatible changes" section, mainly because there are cases with arguments that are incompatible with a given size where the function used to return values and now it raises an error. My guess is that case will be pretty rare in the wild (but I admit it is just a guess).

I refrained from using "big bold letters". 😄

@ev-br
Copy link
Member

ev-br commented Jun 18, 2016

That's big enough and bold enough for me :-). This PR looks good to me now. I'm going to merge it before branching unless there are further comments.

@WarrenWeckesser WarrenWeckesser added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Jun 19, 2016
@ev-br
Copy link
Member

ev-br commented Jun 19, 2016

Needs a rebase (again).

Here's something I don't understand:

In [22]: scipy.__version__
Out[22]: '0.18.0.dev0+36683cf'

In [23]:  stats.norm.rvs(size=4, random_state=np.random.RandomState(1234))
Out[23]: array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 ])

In [24]: stats.levy_stable.rvs(alpha=1, beta=0, size=4, random_state=np.random.RandomState(1234))
Out[24]: array([-0.12827986, -0.73108696, -2.39290307, -0.17495177])

In [25]: stats.levy_stable.rvs(alpha=1.5, beta=0.3, size=4, random_state=np.random.RandomState(1234))
Out[25]: array([-0.40182381,  1.65060746, -1.64155074,  0.49261358])

In scipy 0.17:

In [16]: scipy.__version__
Out[16]: '0.17.0'

In [17]: stats.norm.rvs(size=4, random_state=np.random.RandomState(1234))
Out[17]: array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 ])

In [18]:  stats.levy_stable.rvs(alpha=1, beta=0, size=4, random_state=np.random.RandomState(1234))
Out[18]: array([-1.13381723,  0.56711265,  3.39976097,  0.03294947])

In [19]:  stats.levy_stable.rvs(alpha=1.5, beta=0.3, size=4, random_state=np.random.RandomState(1234))
Out[19]: array([ 0.67579527, -0.76668594,  0.82701716,  1.36710987])

Notice that the random draws are the same for norm (I also spot-checked gamma and rice), as they should. For levy_stable however, they differ.

The class rv_generic was updated to handle broadcasting of the
arguments to the rvs() method.

The following distributions required additional custom changes to
support broadcasting in rvs():
    levy_stable, pearson3, rice, truncnorm,
    hypergeom, planck, randint.

The function `numpy.broadcast_to` is used in several places.  This
function was added to numpy in version 1.10, so a copy was added to
_lib/_numpy_compat.py to support older version of numpy.

Closes scipygh-2069.
…e notes.

Also added subheadings to the section of the release notes on backwards
incompatible changes.
@WarrenWeckesser
Copy link
Member Author

Looks like levy_stable._rvs() wasn't updated along with the rest of the distributions when the _random_state attribute was added. In scipy 0.17.1:

In [4]: stats.levy_stable.rvs(alpha=1, beta=0, size=4, random_state=np.random.RandomState(1234))
Out[4]: array([ 1.84349104, -2.86730347, -3.2989345 ,  0.44281769])

In [5]: stats.levy_stable.rvs(alpha=1, beta=0, size=4, random_state=np.random.RandomState(1234))
Out[5]: array([-0.07054039,  1.90011002, -3.46018551, -4.57797445])

Those outputs should be the same.

I'll fix that now.

@WarrenWeckesser WarrenWeckesser force-pushed the stats-rvs-broadcasting branch from ac350eb to 761ac90 Compare June 19, 2016 14:49
@ev-br
Copy link
Member

ev-br commented Jun 19, 2016

Ouch. levy_stable is skipped in the loop over all distributions which, among other things, calls check_random_state_property, so it was not noticed.
This actually was my mess, thank you for cleaning it up!

@WarrenWeckesser
Copy link
Member Author

Updated with fix for levy_stable.

@ev-br
Copy link
Member

ev-br commented Jun 19, 2016

LGTM, merging. Thanks Warren!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants