-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
ENH: stats.obrientransform: add array API support #21055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if is_numpy(xp): | ||
return xp.array(arrays, dtype=object) | ||
else: | ||
return arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem super nice; alternative suggestions gratefully received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we have the switch depend on whether SCIPY_ARRAY_API=True
rather than whether xp
is NumPy? At some point I think we'll have to emit warnings before SCIPY_ARRAY_API=1
becomes the default behavior; I don't think we will want to carry this object array return thing forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we deprecated the function and create an obrien_transform
or obrien
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too restrictive to require all the samples to have the same shape? Then this would be a fairly easy deprecation to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems too restrictive because the transform doesn't seem to have that restriction.
In fact, the transform on each array seems independent of the others. I was going to say that we could deprecate the use of multiple arguments, since there is no advantage to passing them in together. But it looks like then we'd always be adding an extra dimension to the result : /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two options:
- Create a new function and deprecate this one. If there are other breaking changes we would like to make then this could be a good idea, otherwise this is quite a lot of churn.
- Add a
legacy
argument. Initially, this defaults to true, preserving the default behaviour but emitting a deprecation warning. Setting it to false triggers a new return-type behaviour. After two releases, the default changes from true to false, and then after another two releases, we remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically think option 1 is easier for us (not having to tiptoe around the old code) and just as easy or easier for users (e.g. add an underscore to the name rather than specifying a legacy
argument and then having to remove it later). Option 1 is a lot of churn, but I tend to think the two deprecation cycles required by Option 2 is more churn than one cycle.
There are other options:
- Add
axis
and warn that specifyingaxis
is required because the default is changing fromNone
toaxis=0
to match other stats functions. Document that if the user specifiesaxis
, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the defaultaxis
is0
instead ofNone
. - Defer deprecation to whenever
SCIPY_ARRAY_API=1
is going to become the only behavior. A lot of other things are going to need deprecation at that time. In the meantime, whether a tuple or object array is returned depends onSCIPY_ARRAY_API
. - Leave as is
- Skip this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add
axis
and warn that specifyingaxis
is required because the default is changing fromNone
toaxis=0
to match other stats functions. Document that if the user specifiesaxis
, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the defaultaxis
is0
instead ofNone
.
This seems perhaps preferable as it kills multiple birds with one stone unless there are other things we would gain by rewriting.
If you are happy with this I can do this as a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We can try that, but it would need a discourse post. I can review the rest in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! As a follow-up, would you add axis
and - if it's easy to do in terms of xp_mean
and xp_var
- add nan_policy
?
if dtype is None or xp.isdtype(difference.dtype, dtype): | ||
dtype = difference.dtype | ||
TINY = math.sqrt(xp.finfo(dtype).eps) | ||
if abs(difference) > TINY: | ||
raise ValueError('Lack of convergence in obrientransform.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unusual to check the output of a non-iterative algorithm in SciPy, especially with a fixed tolerance. Also, the tolerance is eps**0.5
, which should be a relative tolerance, but it is used as an absolute tolerance. (That's how it was before these changes, so nothing to do with this PR.)
|
||
for sample in samples: | ||
a = np.asarray(sample) | ||
n = len(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought that the behavior of this function was like axis=None
before because the mean
and sum
functions are used without an explicit axis
argument, and I think I saw xp_size
below. I didn't notice this n=len(a)
. This means that the function was simply wrong for n-d arrays before; it did not produce a valid result equivalent to axis=None
. So I don't know if an explicit deprecation of leaving axis
unspecified is really required to axis
with a default of 0
. : /
We could still add it and require that it be specified explicitly, but I think it would just be an excuse for changing the output type, not for adding axis
with a default of 0
.
|
||
# The O'Brien transform. | ||
t = ((n - 1.5) * n * sq - 0.5 * sumsq) / ((n - 1) * (n - 2)) | ||
|
||
# Check that the mean of the transformed data is equal to the | ||
# original variance. | ||
var = sumsq / (n - 1) | ||
if abs(var - np.mean(t)) > TINY: | ||
difference = var - xp.mean(t) | ||
# avoid recomputing `TINY` if not required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic OK, but I'm not sure if it's saving us anything but a square root, at least with NumPy?
The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling finfo() repeatedly inside your functions is not a problem.
Given the requirement to check xp.isdtype
, it might not be worth the complexity. Did you compare the performance?
if dtype is None or xp.isdtype(difference.dtype, dtype): | ||
dtype = difference.dtype | ||
TINY = math.sqrt(xp.finfo(dtype).eps) | ||
if abs(difference) > TINY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me this should always have been abs(difference / mean)
, where mean = xp.mean(t)
.
I don't think this check has worked as intended for the past decade, and there is no corresponding unit test, so what do you think about removing it with a comment?
return xp.array(arrays, dtype=object) | ||
else: | ||
return arrays | ||
return xp.stack(arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the plan is to immediately change this, but I I still fill that in the meantime the switch should be SCIPY_ARRAY_API
. At least we should not return a single xp
array when possible and a tuple of xp
-arrays otherwise because an xp
-array can't necessarily even be unpacked.
reps = np.array([5, 11, 9, 3, 2, 2]) | ||
data = np.repeat(values, reps) | ||
transformed_values = np.array([3.1828, 0.5591, 0.0344, | ||
1.6086, 5.2817, 11.0538]) | ||
expected = np.repeat(transformed_values, reps) | ||
reps = xp.asarray([5, 11, 9, 3, 2, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep using NumPy to compute the data and expected values and convert them to xp
-type only when needed?
Closing for now as I don't currently have the bandwidth to push this forward |
Reference issue
towards #20544
What does this implement/fix?
Additional information
The original tests seem quite weak