Skip to content

Conversation

pvanmulbregt
Copy link
Contributor

Address incorrect probabilities, broadcasting issues and thread-safety
related to distributions setting member variables inside _argcheck().
Addresses gh-9896, gh-2002, gh-1320, gh-7933.

Add method _get_support(*args) to rv_generic (the default behaviour is to
return sef.a, self.b.) Distributions which have support depending on parameters
need to override this method.
Add overrides for genextreme, kappa4, reciprocal, truncexpon, truncnorm,
binom, bernoulli, hypergeom, boltzmann, randint, rv_sample.
Move code which sets members variables from _argcheck() into _get_support()
and discard any code which actually sets member data.
As most code previously called self._argcheck() anyway before calling
_cdf()/_pdf()/_sf()/..., the impact on performance is perhaps minimal.
This fixes threading concerns, because the (global) distribution is no longer
being changed every call.
Fixes broadcasting issues because the _cdf(x, *args, **kwargs)/... calls now
actually use *args, which has the correct size.

@ev-br
Copy link
Member

ev-br commented Mar 16, 2019

Looks good on a quick browse.
Performance is probably not a big concern indeed, the overhead is huge already, it's unlikely to get significantly worse.
Note that the framework won't be fully thread-safe anyway, because of the self._size dance in rvs.

Could you expand a bit: you mention this addresses several issues, what is their status with this change? Are they fixed? In that case, it'd be helpful to add specific regression tests, with references to GH issues.
If cooking up a threading regression test is not too difficult, it'd be nice, too.

@pvanmulbregt
Copy link
Contributor Author

To be explicit, this PR fixes:

The most recent pushes added named tests for the 4 Issues. An errorful threading situation is simulated in one of these tests. It fails prior to gh-9990, passes with it.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to digest all the code in this PR, will take some time. In the meanwhile, inequality in my comment was broken in irwinhall pr so pointing that out.

if self.b < np.inf:
right = self.b
_a, _b = self._get_support(*args)
if _a > -np.inf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if arguments are array_like the truthyness of the inequality is ambiguous, similar for _b just beneath.
For example, in case of irwinhall (or even binom support and n parameter is a 2 entry array_like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Previously self.a and self.b were fixed numbers. Now they are dependent on*args, which might not be specifying a single distribution, but a family of them. That will also impact the determination of left and right in the block just below this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 36a9e03 addressed this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 36a9e03 addressed this issue.

Yes.

@pvanmulbregt
Copy link
Contributor Author

The test failures appear to be unrelated.

  • The ci/circleci: pypy3 test failed immediately with a 404 error getting the Debian security updates.
  • The ci/circleci: build_docs target built the html and latex but failed the pdflatex with Invalid fontname [FreeSerif.otf]/OT', contains '['`

@ilayn
Copy link
Member

ilayn commented Apr 3, 2019

build_docs should go away if you rebase it

…embers.

Address incorrect probabilities, broadcasting issues and thread-safety
related to distributions setting member variables inside _argcheck().
Addresses scipygh-9896, scipygh-2002, scipygh-1320, scipygh-7933.

Add method _get_support(*args) to rv_generic (the default behaviour is to
return sef.a, self.b.)  Distributions which have support depending on parameters
need to override this method.
Add overrides for genextreme, kappa4, reciprocal, truncexpon, truncnorm,
binom, bernoulli, hypergeom, boltzmann, randint, rv_sample.
Move code which sets members variables from _argcheck() into _get_support()
and discard any code which actually sets member data.
As most code previously called self._argcheck() anyway before calling
_cdf()/_pdf()/_sf()/..., the impact on performance is perhaps minimal.
This fixes threading concerns, because the (global) distribution is no longer
being changed every call.
Fixes broadcasting issues because the _cdf(x, *args, **kwargs)/... calls now
actually use *args, which has the correct size.
rv_sample, rv_hist are not the same as rv_discrete/rv_continuous; they create
separate instances so setting of member variables is necessary.
Added stats/tests/test_distributions::test_retrieving_support() to test
<dist>.support(*args, loc, scale) for all distributions.
…inuous_basic.py.

Added test_nomodify_gh9900_regression() to simulate thread interleaving.
Added test_broadcast_gh9990_regression() to validate broadcast for
out-of-support arguments, and to check support endpoints.
…h-1320.

Added checks for broadcasting in stats/tests/test_continuous_basic.py:
test_broadcast_gh7933_regression(), test_gh2002_regression(), and
test_gh1320_regression().
@pvanmulbregt
Copy link
Contributor Author

Rebase fixed the build_docs and the pypy3 target also succeeded.

@ev-br ev-br added this to the 1.3.0 milestone Apr 14, 2019
@ev-br
Copy link
Member

ev-br commented Apr 14, 2019

Am adding the 1.3.0 milestone, if only to indicate that I'm going to try to give it a final review before branching 1.3.0. The decision is RM's, of course.

def _get_support(self, *args):
"""Return the support of the (unscaled, unshifted) distribution.

*Must* be overridden by distributions which have support dependent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent that when a new distribution is added and the distribution's support depends on a parameter value, that the value is set in the _get_support method that is overridden in the child class?

If so maybe state that explicitly here? I read this as you must override the function with a distribution specific one, but not that the code in the _get_support should override the support limits.

If not, then where is the developer to place the overridden a and b parameters (for location and scale)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the intent to ever set self.a or self.b for such a distribution. The issue is that a single instance of the class truncexpon is shared amongst all users of the truncexpon distribution, so setting it in one circumstance sets it for all. Even a frozen distribution is not actually frozen --- what are frozen are the constructor arguments so that it can use the distribution at some later time.

self.a and self.b do not need to get changed for location and scale --- handling that happens automatically in the methods of rv_continuous, such as rv_continuous.cdf etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I understood correctly. I'd recommend adding a line here that states explicitly not to use _get_support to set the values either. I'm worried the message will not be explicit enough for future distribution creators. Maybe something like:

"The self.a and self.b limits should not be set by this method, rather this is a utility method that returns the support limits for use in other distribution class methods. "

Or something similar if you prefer a different wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the docstring of rv_generic and rv_continuous to be more explicit about the need for overriding _get_support, and modified the docstring of _get_support to be explicit on not permitting any modification of any members of the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, that will make it easier on new developers coming into the scipy.stats submodule in the future (and me when I need to remember this again).

@rlucas7
Copy link
Member

rlucas7 commented Apr 17, 2019

Was able to look at this a bit last night, here are my findings:

confirmed on python 3.6 on my box that this fixes part of gh-1320 and all of gh-7933.

Here is session for gh-7933:

Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> from scipy.stats import truncnorm
>>> truncnorm.logpdf(
...     np.array([3.0, 2.0, 1.0]),
...     a=(1.5-np.array([6.0, 5.0, 4.0]))/3.0,
...     b=np.inf,
...     loc=np.array([6.0, 5.0, 4.0]),
...     scale=3.0
... )
array([-2.44840737, -2.38781507,        -inf])

and here is the session for gh-1320

(scipy-dev) Lucass-MacBook:scipy rlucas$ python
Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy.stats as ss
>>> c=2.62
>>> c
2.62
>>> ss.genextreme.ppf(0.5,np.array([[c],[c+0.5]]))
>>> import numpy as np
>>> ss.genextreme.ppf(0.5,np.array([[c],[c+0.5]]))
array([[0.23557562],
       [0.21836694]])
>>> ss.genextreme._ppf(0.5,np.array([c,c+0.5]))
array([0.23557562, 0.21836694])
>>> ss.genextreme._mom0_sc(2,np.array([c,c+0.5]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rlucas/scipy-dev/scipy/scipy/stats/_distn_infrastructure.py", line 1660, in _mom0_sc
    args=(m,)+args)[0]
  File "/Users/rlucas/scipy-dev/scipy/scipy/integrate/quadpack.py", line 337, in quad
    flip, a, b = b < a, min(a, b), max(a, b)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> c
2.62
>>> ss.genextreme.moment(2,np.array([c,c+0.5]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rlucas/scipy-dev/scipy/scipy/stats/_distn_infrastructure.py", line 1160, in moment
    if not (self._argcheck(*args) and (scale > 0)):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> ss.genextreme.moment(5,np.array([c,c+0.5]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rlucas/scipy-dev/scipy/scipy/stats/_distn_infrastructure.py", line 1160, in moment
    if not (self._argcheck(*args) and (scale > 0)):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> 
(scipy-dev) Lucass-MacBook:scipy rlucas$ 

So far so good, will keep looking to see what I find on other addressed issues.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this tonight. This is a nice fix to an issue. The PR addresses the issue for univariate continuous and discrete distributions. The issue does not exist with multivariate distributions.

@pvanmulbregt LGTM, hopefully @ev-br (or another scipy member) can take a look before the 1.3.0 release drop date (which is soon). I would love to have this in the release and unblock the irwinhall dev work.

if self.b < np.inf:
right = self.b
_a, _b = self._get_support(*args)
if _a > -np.inf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 36a9e03 addressed this issue.

Yes.

@rgommers
Copy link
Member

I have no time to look in detail before branching; I suggest to wait for @ev-br's final review for now and if he doesn't get to it then merge as is, since it's a clear improvement and all open comments have been addressed.

@ev-br
Copy link
Member

ev-br commented Apr 23, 2019

Don't wait for me, there's no point holding this PR.

@rgommers rgommers merged commit 1ed24d6 into scipy:master Apr 23, 2019
@rgommers
Copy link
Member

Okay, in it goes!

Thanks @pvanmulbregt and all reviewers!

@rgommers rgommers added the maintenance Items related to regular maintenance tasks label Apr 23, 2019
@tylerjereddy
Copy link
Contributor

I went through and manually closed the issues that were claimed to be addressed in the initial PR description, and added 1.3.0 milestones to those issues. For one of them, based on the extended discussion above, it looks like this PR only provides a partial fix, so maybe another issue should be opened to split off residual problems.

@tylerjereddy
Copy link
Contributor

Added crude "release note" to end of other changes section. Feel free to improve or just remove if not really appropriate there.

@pvanmulbregt
Copy link
Contributor Author

I went through and manually closed the issues that were claimed to be addressed in the initial PR description, and added 1.3.0 milestones to those issues. For one of them, based on the extended discussion above, it looks like this PR only provides a partial fix, so maybe another issue should be opened to split off residual problems.

To be explicit (and save me having to think too hard some time in the future), I presume you are referring to gh-1320. If so, (see #9900 (comment) this PR (and its patches) indeed only fixes the first example; the second example is still broken, and the third example broke in the intervening years.

@tylerjereddy
Copy link
Contributor

@pvanmulbregt correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants