Skip to content

Conversation

j-bowhay
Copy link
Member

@j-bowhay j-bowhay commented Jun 26, 2024

Reference issue

towards #20544

What does this implement/fix?

Additional information

The original tests seem quite weak

@j-bowhay j-bowhay added this to the 1.15.0 milestone Jun 26, 2024
@j-bowhay j-bowhay requested a review from mdhaber June 26, 2024 10:00
@github-actions github-actions bot added scipy.stats enhancement A new feature or improvement labels Jun 26, 2024
Comment on lines +2856 to +2859
if is_numpy(xp):
return xp.array(arrays, dtype=object)
else:
return arrays
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem super nice; alternative suggestions gratefully received

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have the switch depend on whether SCIPY_ARRAY_API=True rather than whether xp is NumPy? At some point I think we'll have to emit warnings before SCIPY_ARRAY_API=1 becomes the default behavior; I don't think we will want to carry this object array return thing forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we deprecated the function and create an obrien_transform or obrien instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too restrictive to require all the samples to have the same shape? Then this would be a fairly easy deprecation to remove

Copy link
Contributor

@mdhaber mdhaber Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems too restrictive because the transform doesn't seem to have that restriction.

In fact, the transform on each array seems independent of the others. I was going to say that we could deprecate the use of multiple arguments, since there is no advantage to passing them in together. But it looks like then we'd always be adding an extra dimension to the result : /

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options:

  1. Create a new function and deprecate this one. If there are other breaking changes we would like to make then this could be a good idea, otherwise this is quite a lot of churn.
  2. Add a legacy argument. Initially, this defaults to true, preserving the default behaviour but emitting a deprecation warning. Setting it to false triggers a new return-type behaviour. After two releases, the default changes from true to false, and then after another two releases, we remove it.

Copy link
Contributor

@mdhaber mdhaber Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically think option 1 is easier for us (not having to tiptoe around the old code) and just as easy or easier for users (e.g. add an underscore to the name rather than specifying a legacy argument and then having to remove it later). Option 1 is a lot of churn, but I tend to think the two deprecation cycles required by Option 2 is more churn than one cycle.

There are other options:

  • Add axis and warn that specifying axis is required because the default is changing from None to axis=0 to match other stats functions. Document that if the user specifies axis, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the default axis is 0 instead of None.
  • Defer deprecation to whenever SCIPY_ARRAY_API=1 is going to become the only behavior. A lot of other things are going to need deprecation at that time. In the meantime, whether a tuple or object array is returned depends on SCIPY_ARRAY_API.
  • Leave as is
  • Skip this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add axis and warn that specifying axis is required because the default is changing from None to axis=0 to match other stats functions. Document that if the user specifies axis, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the default axis is 0 instead of None.

This seems perhaps preferable as it kills multiple birds with one stone unless there are other things we would gain by rewriting.

If you are happy with this I can do this as a separate pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can try that, but it would need a discourse post. I can review the rest in the meantime.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! As a follow-up, would you add axis and - if it's easy to do in terms of xp_mean and xp_var - add nan_policy?

if dtype is None or xp.isdtype(difference.dtype, dtype):
dtype = difference.dtype
TINY = math.sqrt(xp.finfo(dtype).eps)
if abs(difference) > TINY:
raise ValueError('Lack of convergence in obrientransform.')
Copy link
Contributor

@mdhaber mdhaber Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unusual to check the output of a non-iterative algorithm in SciPy, especially with a fixed tolerance. Also, the tolerance is eps**0.5, which should be a relative tolerance, but it is used as an absolute tolerance. (That's how it was before these changes, so nothing to do with this PR.)


for sample in samples:
a = np.asarray(sample)
n = len(a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought that the behavior of this function was like axis=None before because the mean and sum functions are used without an explicit axis argument, and I think I saw xp_size below. I didn't notice this n=len(a). This means that the function was simply wrong for n-d arrays before; it did not produce a valid result equivalent to axis=None. So I don't know if an explicit deprecation of leaving axis unspecified is really required to axis with a default of 0. : /

We could still add it and require that it be specified explicitly, but I think it would just be an excuse for changing the output type, not for adding axis with a default of 0.


# The O'Brien transform.
t = ((n - 1.5) * n * sq - 0.5 * sumsq) / ((n - 1) * (n - 2))

# Check that the mean of the transformed data is equal to the
# original variance.
var = sumsq / (n - 1)
if abs(var - np.mean(t)) > TINY:
difference = var - xp.mean(t)
# avoid recomputing `TINY` if not required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic OK, but I'm not sure if it's saving us anything but a square root, at least with NumPy?

The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling finfo() repeatedly inside your functions is not a problem.

Given the requirement to check xp.isdtype, it might not be worth the complexity. Did you compare the performance?

if dtype is None or xp.isdtype(difference.dtype, dtype):
dtype = difference.dtype
TINY = math.sqrt(xp.finfo(dtype).eps)
if abs(difference) > TINY:
Copy link
Contributor

@mdhaber mdhaber Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me this should always have been abs(difference / mean), where mean = xp.mean(t).

I don't think this check has worked as intended for the past decade, and there is no corresponding unit test, so what do you think about removing it with a comment?

return xp.array(arrays, dtype=object)
else:
return arrays
return xp.stack(arrays)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the plan is to immediately change this, but I I still fill that in the meantime the switch should be SCIPY_ARRAY_API. At least we should not return a single xp array when possible and a tuple of xp-arrays otherwise because an xp-array can't necessarily even be unpacked.

Comment on lines -6769 to +6770
reps = np.array([5, 11, 9, 3, 2, 2])
data = np.repeat(values, reps)
transformed_values = np.array([3.1828, 0.5591, 0.0344,
1.6086, 5.2817, 11.0538])
expected = np.repeat(transformed_values, reps)
reps = xp.asarray([5, 11, 9, 3, 2, 2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep using NumPy to compute the data and expected values and convert them to xp-type only when needed?

@j-bowhay
Copy link
Member Author

Closing for now as I don't currently have the bandwidth to push this forward

@j-bowhay j-bowhay closed this Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants