ENH: stats.obrientransform: add array API support #21055

j-bowhay · 2024-06-26T10:00:27Z

Reference issue

towards #20544

What does this implement/fix?

Additional information

The original tests seem quite weak

scipy/stats/tests/test_stats.py

j-bowhay · 2024-06-26T10:07:17Z

scipy/stats/_stats_py.py

+                if is_numpy(xp):
+                    return xp.array(arrays, dtype=object)
+                else:
+                    return arrays


This doesn't seem super nice; alternative suggestions gratefully received

Maybe we have the switch depend on whether SCIPY_ARRAY_API=True rather than whether xp is NumPy? At some point I think we'll have to emit warnings before SCIPY_ARRAY_API=1 becomes the default behavior; I don't think we will want to carry this object array return thing forward.

Or maybe we deprecated the function and create an obrien_transform or obrien instead.

Would it be too restrictive to require all the samples to have the same shape? Then this would be a fairly easy deprecation to remove

Yes, it seems too restrictive because the transform doesn't seem to have that restriction.

In fact, the transform on each array seems independent of the others. I was going to say that we could deprecate the use of multiple arguments, since there is no advantage to passing them in together. But it looks like then we'd always be adding an extra dimension to the result : /

I see two options:

Create a new function and deprecate this one. If there are other breaking changes we would like to make then this could be a good idea, otherwise this is quite a lot of churn.

Add a legacy argument. Initially, this defaults to true, preserving the default behaviour but emitting a deprecation warning. Setting it to false triggers a new return-type behaviour. After two releases, the default changes from true to false, and then after another two releases, we remove it.

I typically think option 1 is easier for us (not having to tiptoe around the old code) and just as easy or easier for users (e.g. add an underscore to the name rather than specifying a legacy argument and then having to remove it later). Option 1 is a lot of churn, but I tend to think the two deprecation cycles required by Option 2 is more churn than one cycle.

There are other options:

Add axis and warn that specifying axis is required because the default is changing from None to axis=0 to match other stats functions. Document that if the user specifies axis, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the default axis is 0 instead of None.

Defer deprecation to whenever SCIPY_ARRAY_API=1 is going to become the only behavior. A lot of other things are going to need deprecation at that time. In the meantime, whether a tuple or object array is returned depends on SCIPY_ARRAY_API.

Leave as is

Skip this function

Add axis and warn that specifying axis is required because the default is changing from None to axis=0 to match other stats functions. Document that if the user specifies axis, the return type is a tuple rather than an array. At the end of the deprecation period, nothing needs to be removed, the return type is always a tuple, and the default axis is 0 instead of None.

This seems perhaps preferable as it kills multiple birds with one stone unless there are other things we would gain by rewriting.

If you are happy with this I can do this as a separate pr.

Ok. We can try that, but it would need a discourse post. I can review the rest in the meantime.

scipy/stats/_stats_py.py

[skip ci]

mdhaber

Thanks for this! As a follow-up, would you add axis and - if it's easy to do in terms of xp_mean and xp_var - add nan_policy?

mdhaber · 2024-06-26T14:17:27Z

scipy/stats/_stats_py.py

+        if dtype is None or xp.isdtype(difference.dtype, dtype):
+            dtype = difference.dtype
+            TINY = math.sqrt(xp.finfo(dtype).eps)
+        if abs(difference) > TINY:
            raise ValueError('Lack of convergence in obrientransform.')


This looks unusual to check the output of a non-iterative algorithm in SciPy, especially with a fixed tolerance. Also, the tolerance is eps**0.5, which should be a relative tolerance, but it is used as an absolute tolerance. (That's how it was before these changes, so nothing to do with this PR.)

mdhaber · 2024-06-27T18:44:15Z

scipy/stats/_stats_py.py


    for sample in samples:
-        a = np.asarray(sample)
-        n = len(a)


Hmm. I thought that the behavior of this function was like axis=None before because the mean and sum functions are used without an explicit axis argument, and I think I saw xp_size below. I didn't notice this n=len(a). This means that the function was simply wrong for n-d arrays before; it did not produce a valid result equivalent to axis=None. So I don't know if an explicit deprecation of leaving axis unspecified is really required to axis with a default of 0. : /

We could still add it and require that it be specified explicitly, but I think it would just be an excuse for changing the output type, not for adding axis with a default of 0.

mdhaber · 2024-06-27T18:48:57Z

scipy/stats/_stats_py.py


        # The O'Brien transform.
        t = ((n - 1.5) * n * sq - 0.5 * sumsq) / ((n - 1) * (n - 2))

        # Check that the mean of the transformed data is equal to the
        # original variance.
        var = sumsq / (n - 1)
-        if abs(var - np.mean(t)) > TINY:
+        difference = var - xp.mean(t)
+        # avoid recomputing `TINY` if not required


The logic OK, but I'm not sure if it's saving us anything but a square root, at least with NumPy?

The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling finfo() repeatedly inside your functions is not a problem.

Given the requirement to check xp.isdtype, it might not be worth the complexity. Did you compare the performance?

mdhaber · 2024-06-27T18:50:28Z

scipy/stats/_stats_py.py

+        if dtype is None or xp.isdtype(difference.dtype, dtype):
+            dtype = difference.dtype
+            TINY = math.sqrt(xp.finfo(dtype).eps)
+        if abs(difference) > TINY:


Seems to me this should always have been abs(difference / mean), where mean = xp.mean(t).

I don't think this check has worked as intended for the past decade, and there is no corresponding unit test, so what do you think about removing it with a comment?

mdhaber · 2024-06-27T19:03:03Z

scipy/stats/_stats_py.py

+                    return xp.array(arrays, dtype=object)
+                else:
+                    return arrays
+    return xp.stack(arrays)


I know the plan is to immediately change this, but I I still fill that in the meantime the switch should be SCIPY_ARRAY_API. At least we should not return a single xp array when possible and a tuple of xp-arrays otherwise because an xp-array can't necessarily even be unpacked.

mdhaber · 2024-06-27T19:06:49Z

scipy/stats/tests/test_stats.py

-    reps = np.array([5, 11, 9, 3, 2, 2])
-    data = np.repeat(values, reps)
-    transformed_values = np.array([3.1828, 0.5591, 0.0344,
-                                   1.6086, 5.2817, 11.0538])
-    expected = np.repeat(transformed_values, reps)
+    reps = xp.asarray([5, 11, 9, 3, 2, 2])


Can we keep using NumPy to compute the data and expected values and convert them to xp-type only when needed?

j-bowhay · 2024-07-18T14:01:23Z

Closing for now as I don't currently have the bandwidth to push this forward

j-bowhay added 2 commits June 26, 2024 09:36

ENH: stats.obrientransform: add array API support

8461557

ENH: stats.obrientransform: add array API support

1ca1cd9

j-bowhay added this to the 1.15.0 milestone Jun 26, 2024

j-bowhay requested a review from mdhaber June 26, 2024 10:00

github-actions bot added scipy.stats enhancement A new feature or improvement labels Jun 26, 2024

j-bowhay commented Jun 26, 2024

View reviewed changes

scipy/stats/tests/test_stats.py Outdated Show resolved Hide resolved

j-bowhay commented Jun 26, 2024

View reviewed changes

j-bowhay mentioned this pull request Jun 26, 2024

ENH: stats: add array API-support #20544

Open

j-bowhay commented Jun 26, 2024

View reviewed changes

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved

Update scipy/stats/tests/test_stats.py

4dd50d8

[skip ci]

mdhaber reviewed Jun 26, 2024

View reviewed changes

more careful choice of TINY

46897a4

mdhaber reviewed Jun 26, 2024

View reviewed changes

mdhaber reviewed Jun 27, 2024

View reviewed changes

j-bowhay closed this Jul 18, 2024

Uh oh!

ENH: stats.obrientransform: add array API support #21055

ENH: stats.obrientransform: add array API support #21055

Uh oh!

Conversation

j-bowhay commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference issue

What does this implement/fix?

Additional information

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdhaber left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j-bowhay commented Jul 18, 2024

Uh oh!

Uh oh!

j-bowhay commented Jun 26, 2024 •

edited

Loading

mdhaber Jun 26, 2024 •

edited

Loading

mdhaber Jun 26, 2024 •

edited

Loading

mdhaber left a comment •

edited

Loading

mdhaber Jun 26, 2024 •

edited

Loading

mdhaber Jun 27, 2024 •

edited

Loading