-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
BUG: Restrict rv_generic._argcheck() and its overrides from setting members. #9900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks good on a quick browse. Could you expand a bit: you mention this addresses several issues, what is their status with this change? Are they fixed? In that case, it'd be helpful to add specific regression tests, with references to GH issues. |
To be explicit, this PR fixes:
The most recent pushes added named tests for the 4 Issues. An errorful threading situation is simulated in one of these tests. It fails prior to gh-9990, passes with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to digest all the code in this PR, will take some time. In the meanwhile, inequality in my comment was broken in irwinhall
pr so pointing that out.
if self.b < np.inf: | ||
right = self.b | ||
_a, _b = self._get_support(*args) | ||
if _a > -np.inf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if arguments are array_like
the truthyness of the inequality is ambiguous, similar for _b
just beneath.
For example, in case of irwinhall
(or even binom
support and n
parameter is a 2 entry array_like
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Previously self.a
and self.b
were fixed numbers. Now they are dependent on*args
, which might not be specifying a single distribution, but a family of them. That will also impact the determination of left
and right
in the block just below this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 36a9e03 addressed this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 36a9e03 addressed this issue.
Yes.
The test failures appear to be unrelated.
|
build_docs should go away if you rebase it |
…embers. Address incorrect probabilities, broadcasting issues and thread-safety related to distributions setting member variables inside _argcheck(). Addresses scipygh-9896, scipygh-2002, scipygh-1320, scipygh-7933. Add method _get_support(*args) to rv_generic (the default behaviour is to return sef.a, self.b.) Distributions which have support depending on parameters need to override this method. Add overrides for genextreme, kappa4, reciprocal, truncexpon, truncnorm, binom, bernoulli, hypergeom, boltzmann, randint, rv_sample. Move code which sets members variables from _argcheck() into _get_support() and discard any code which actually sets member data. As most code previously called self._argcheck() anyway before calling _cdf()/_pdf()/_sf()/..., the impact on performance is perhaps minimal. This fixes threading concerns, because the (global) distribution is no longer being changed every call. Fixes broadcasting issues because the _cdf(x, *args, **kwargs)/... calls now actually use *args, which has the correct size.
rv_sample, rv_hist are not the same as rv_discrete/rv_continuous; they create separate instances so setting of member variables is necessary.
Added stats/tests/test_distributions::test_retrieving_support() to test <dist>.support(*args, loc, scale) for all distributions.
…inuous_basic.py. Added test_nomodify_gh9900_regression() to simulate thread interleaving. Added test_broadcast_gh9990_regression() to validate broadcast for out-of-support arguments, and to check support endpoints.
…h-1320. Added checks for broadcasting in stats/tests/test_continuous_basic.py: test_broadcast_gh7933_regression(), test_gh2002_regression(), and test_gh1320_regression().
70d6e04
to
36a9e03
Compare
Rebase fixed the build_docs and the pypy3 target also succeeded. |
Am adding the 1.3.0 milestone, if only to indicate that I'm going to try to give it a final review before branching 1.3.0. The decision is RM's, of course. |
def _get_support(self, *args): | ||
"""Return the support of the (unscaled, unshifted) distribution. | ||
|
||
*Must* be overridden by distributions which have support dependent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent that when a new distribution is added and the distribution's support depends on a parameter value, that the value is set in the _get_support
method that is overridden in the child class?
If so maybe state that explicitly here? I read this as you must override the function with a distribution specific one, but not that the code in the _get_support
should override the support limits.
If not, then where is the developer to place the overridden a
and b
parameters (for location and scale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the intent to ever set self.a
or self.b
for such a distribution. The issue is that a single instance of the class truncexpon
is shared amongst all users of the truncexpon
distribution, so setting it in one circumstance sets it for all. Even a frozen distribution is not actually frozen --- what are frozen are the constructor arguments so that it can use the distribution at some later time.
self.a
and self.b
do not need to get changed for location and scale --- handling that happens automatically in the methods of rv_continuous
, such as rv_continuous.cdf
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I understood correctly. I'd recommend adding a line here that states explicitly not to use _get_support
to set the values either. I'm worried the message will not be explicit enough for future distribution creators. Maybe something like:
"The self.a
and self.b
limits should not be set by this method, rather this is a utility method that returns the support limits for use in other distribution class methods. "
Or something similar if you prefer a different wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the docstring of rv_generic
and rv_continuous
to be more explicit about the need for overriding _get_support
, and modified the docstring of _get_support
to be explicit on not permitting any modification of any members of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that, that will make it easier on new developers coming into the scipy.stats
submodule in the future (and me when I need to remember this again).
Was able to look at this a bit last night, here are my findings: confirmed on python 3.6 on my box that this fixes part of gh-1320 and all of gh-7933. Here is session for gh-7933:
and here is the session for gh-1320
So far so good, will keep looking to see what I find on other addressed issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look at this tonight. This is a nice fix to an issue. The PR addresses the issue for univariate continuous and discrete distributions. The issue does not exist with multivariate distributions.
@pvanmulbregt LGTM, hopefully @ev-br (or another scipy member) can take a look before the 1.3.0 release drop date (which is soon). I would love to have this in the release and unblock the irwinhall
dev work.
if self.b < np.inf: | ||
right = self.b | ||
_a, _b = self._get_support(*args) | ||
if _a > -np.inf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 36a9e03 addressed this issue.
Yes.
I have no time to look in detail before branching; I suggest to wait for @ev-br's final review for now and if he doesn't get to it then merge as is, since it's a clear improvement and all open comments have been addressed. |
Don't wait for me, there's no point holding this PR. |
Okay, in it goes! Thanks @pvanmulbregt and all reviewers! |
I went through and manually closed the issues that were claimed to be addressed in the initial PR description, and added 1.3.0 milestones to those issues. For one of them, based on the extended discussion above, it looks like this PR only provides a partial fix, so maybe another issue should be opened to split off residual problems. |
Added crude "release note" to end of other changes section. Feel free to improve or just remove if not really appropriate there. |
To be explicit (and save me having to think too hard some time in the future), I presume you are referring to gh-1320. If so, (see #9900 (comment) this PR (and its patches) indeed only fixes the first example; the second example is still broken, and the third example broke in the intervening years. |
@pvanmulbregt correct |
Address incorrect probabilities, broadcasting issues and thread-safety
related to distributions setting member variables inside _argcheck().
Addresses gh-9896, gh-2002, gh-1320, gh-7933.
Add method _get_support(*args) to rv_generic (the default behaviour is to
return sef.a, self.b.) Distributions which have support depending on parameters
need to override this method.
Add overrides for genextreme, kappa4, reciprocal, truncexpon, truncnorm,
binom, bernoulli, hypergeom, boltzmann, randint, rv_sample.
Move code which sets members variables from _argcheck() into _get_support()
and discard any code which actually sets member data.
As most code previously called self._argcheck() anyway before calling
_cdf()/_pdf()/_sf()/..., the impact on performance is perhaps minimal.
This fixes threading concerns, because the (global) distribution is no longer
being changed every call.
Fixes broadcasting issues because the _cdf(x, *args, **kwargs)/... calls now
actually use *args, which has the correct size.