Skip to content

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Feb 22, 2024

Reference issue

Closes gh-9307

What does this implement/fix?

This adds native support for an axis argument to pearsonr, greatly improving the speed of vectorized calls (e.g. pairwise comparisons).

Additional information

Failing existing tests pending resolution of gh-20136. Worked around it.
Needs new tests of axis. Added.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Feb 22, 2024
r = dtype(np.sign(x[1] - x[0])*np.sign(y[1] - y[0]))
result = PearsonRResult(statistic=r, pvalue=1.0, n=n,
alternative=alternative, x=x, y=y)
r = dtype(np.sign(x[..., 1] - x[..., 0])*np.sign(y[..., 1] - y[..., 0]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis=axis elsewhere may be misleading, because really we rely on axis=-1. (I suppose we could take_along_axis if desired, but it's simpler to just move the axis, and might be faster to work along last axis if we take the time to copy.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, dtype(...) seems like a weird way to initialize the array. Why not (...).astype(dtype) or np.asarray(..., dtype=dtype)

@mdhaber mdhaber changed the title WIP: stats.pearsonr: add support for axis argument W: stats.pearsonr: add support for axis argument Feb 23, 2024
@mdhaber mdhaber changed the title W: stats.pearsonr: add support for axis argument ENH: stats.pearsonr: add support for axis argument Feb 23, 2024
@mdhaber mdhaber marked this pull request as ready for review February 23, 2024 23:05
[skip actions] [skip cirrus]
def statistic(y):
statistic, _ = pearsonr(x, y, alternative=alternative)
def statistic(y, axis):
statistic, _ = pearsonr(x, y, axis=axis, alternative=alternative)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this and a few similar code blocks don't get hit by the regular test suite. Not sure how important that is, but ran a local check just now.

Copy link
Contributor Author

@mdhaber mdhaber Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're hit by test_resampling_pvalue, which is xslowed. Probably fast enough to run every time now that this is natively vectorized. Would you prefer that the xslow be removed?

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 17, 2024

@tirthasheshpatel Any chance we can get this into 1.13.0?

Sorry for the late milestone addition, but the feature has been requested many times on S.O. (in addition to the linked issue).

@mdhaber mdhaber added this to the 1.13.0 milestone Mar 17, 2024
@tylerjereddy
Copy link
Contributor

Branching is imminent, so I'll bump the milestone, apologies.

@tylerjereddy tylerjereddy modified the milestones: 1.13.0, 1.14.0 Mar 17, 2024
@@ -4771,42 +4815,52 @@ def statistic(x, y):
# dtype is the data type for the calculations. This expression ensures
# that the data type is at least 64 bit floating point. It might have
# more precision if the input is, for example, np.longdouble.
dtype = type(1.0 + x[0] + y[0])
dtype = type(1.0 + x.ravel()[0] + y.ravel()[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was in no-think-just-translate mode. We discussed correcting the calculation to respect user-provided dtype.

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of stylistic comments.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 20, 2024

I know we discussed letting NumPy 2.0 take care of the scalar dtype issue, but I'm going to push one more commit that fixes it for the statistic. We'll let NumPy 2.0 / NEP 50 fix the scalar dtype conversion of the confidence interval.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 21, 2024

I think this is all set. Thanks @tirthasheshpatel!

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good now, thanks @mdhaber!

@tirthasheshpatel tirthasheshpatel merged commit ceea3e3 into scipy:main Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: make scipy.stats.pearsonr accept 2-D arrays
3 participants