API: `stats._resampling`: transition to rng (SPEC 7) #21854

mdhaber · 2024-11-09T20:04:25Z

Reference issue

What does this implement/fix?

I combined the two functions in scipy.stats._resampling into one PR because it seemed like a logical unit. I'll need to do the two classes BootstrapMethod and PermutationMethod in a separate PR because they have attributes random_state that may require additional attention.

Additional information

I'm noticing that a lot of uses of random_state need to remain in these files because they use old distributions. Most of these are just the normal distribution, though, so it would be easy enough to come back and change these to use the new infrastructure so use of rng is more consistent (especially in documentation).

andyfaff · 2024-11-09T21:05:05Z

I'm noticing that a lot of uses of random_state need to remain in these files because they use old distributions.

Yeah, I think that has the potential to confuse a lot of people.

andyfaff · 2024-11-09T21:12:07Z

scipy/stats/_resampling.py

        def batched_perm_generator():
            indices = np.arange(n_obs_sample)
            indices = np.tile(indices, (batch, n_samples, 1))
            for k in range(0, n_permutations, batch):
                batch_actual = min(batch, n_permutations-k)
                # Don't permute in place, otherwise results depend on `batch`
-                permuted_indices = random_state.permuted(indices, axis=-1)
+                permuted_indices = rng.permuted(indices, axis=-1)


Can this section be simplified by using rng.permutation?

Maybe? Note that this section is doing different things with RandomStates and Generators for a reason. Half of this can go away when we remove support for RandomState. If there is additional simplification possible for Generators, it seems separate from this PR, right?

Update: What did you have in mind? Some things to keep in mind: We need to permute slices independently because this works with N-D arrays. We need permuted indices, not a permutation of the original data, because there is also an exhaustive permutation method. And we use a different method for RandomState than Generator because RandomState.permutation is slow.

~~Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.~~

(Also, I think Generator.permute(indices, axis=-1) and Generator.permutation(indices, axis=-1) are somewhat equivalent. It's not clear to me why there are two Generator methods doing the same thing)

EDIT: Generator.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

Now I look into it further, RandomState.permutation only shuffles along the first axis.

Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.

Rather than just thinking about it, let's refer to the documentation

and test it:

import numpy as np A = np.arange(5) B = np.stack([A, A, A, A, A]) np.random.RandomState().permutation(B.T) # array([[0, 0, 0, 0, 0], # [1, 1, 1, 1, 1], # [3, 3, 3, 3, 3], # [4, 4, 4, 4, 4], # [2, 2, 2, 2, 2]])

So no, RandomState.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state to rng, I agree!

You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.

mdhaber · 2024-11-09T21:59:03Z

Yeah, I think that has the potential to confuse a lot of people.

I don't think we can change all of that until gh-21707 merges, but after that I can go throug and replace uses of old distributions with new.

mdhaber

Thanks @andyfaff. Did you have any comments about this PR, or does it look OK?

mdhaber · 2024-11-13T04:30:21Z

scipy/stats/_resampling.py

        def batched_perm_generator():
            indices = np.arange(n_obs_sample)
            indices = np.tile(indices, (batch, n_samples, 1))
            for k in range(0, n_permutations, batch):
                batch_actual = min(batch, n_permutations-k)
                # Don't permute in place, otherwise results depend on `batch`
-                permuted_indices = random_state.permuted(indices, axis=-1)
+                permuted_indices = rng.permuted(indices, axis=-1)


Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.

Rather than just thinking about it, let's refer to the documentation

and test it:

import numpy as np A = np.arange(5) B = np.stack([A, A, A, A, A]) np.random.RandomState().permutation(B.T) # array([[0, 0, 0, 0, 0], # [1, 1, 1, 1, 1], # [3, 3, 3, 3, 3], # [4, 4, 4, 4, 4], # [2, 2, 2, 2, 2]])

So no, RandomState.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state to rng, I agree!

You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.

andyfaff · 2024-11-13T04:39:18Z

scipy/stats/_resampling.py

@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch,
        # can permute axis-slices independently. If this feature is
        # added in the future, batches of the desired size should be
        # generated in a single call.
-        perm_generator = (random_state.permutation(n_obs)
+        perm_generator = (rng.permutation(n_obs)


Eventually Generator.permuted should be able to do this?

You can see in the code above and in the documentation that Generator.permuted can do this now. At the time this code was merged (2021), permuted was not available in all supported versions of NumPy.

mdhaber

Thanks @andyfaff. Did you have any comments about this PR, or did it look good?

mdhaber · 2024-11-13T04:56:09Z

scipy/stats/_resampling.py

@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch,
        # can permute axis-slices independently. If this feature is
        # added in the future, batches of the desired size should be
        # generated in a single call.
-        perm_generator = (random_state.permutation(n_obs)
+        perm_generator = (rng.permutation(n_obs)


You can see in the code above and in the documentation that Generator.permuted can do this now. At the time this code was merged (2021), permuted was not available in all supported versions of NumPy.

andyfaff · 2024-11-13T05:03:14Z

I think the PR is mergeable as-is.

However, I think it's also worth the time modifying the tests a.la. #21872, i.e.

Removing all occurrences of np.random.seed
Removing all usages of the global RandomState that generate test data (e.g. here) and converting them to use a specific instance of a rng.
When 2 is carried out, the specific instance may as well be a Generator.

Perhaps it can be done in a different PR, but eventually it's going to have to be done for free-threaded compatibility (#21496).

mdhaber · 2024-11-13T05:34:15Z

I think the PR is mergeable as-is.

Great! I'll go ahead and do that.

However, I think it's also worth the time modifying the tests a.la. #21872, ... Perhaps it can be done in a different PR

Yes, thanks. In the interest of getting the SPEC 7 transitions done in one release cycle, I'd like to focus SPEC 7 PRs on SPEC 7. So I'll go ahead and merge this so that someone can adjust the tests without the potential to run into merge conflicts.

MAINT: stats._resampling: transition to rng

57e210a

mdhaber added the maintenance Items related to regular maintenance tasks label Nov 9, 2024

github-actions bot added the scipy.stats label Nov 9, 2024

mdhaber mentioned this pull request Nov 9, 2024

SPEC 7 Transition Tracker #21833

Closed

50 tasks

andyfaff reviewed Nov 9, 2024

View reviewed changes

mdhaber added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 9, 2024

lucascolley changed the title ~~MAINT: stats._resampling: transition to rng~~ API: stats._resampling: transition to rng (SPEC 7) Nov 10, 2024

lucascolley added this to the 1.15.0 milestone Nov 10, 2024

mdhaber requested a review from andyfaff November 12, 2024 18:10

mdhaber commented Nov 13, 2024

View reviewed changes

andyfaff reviewed Nov 13, 2024

View reviewed changes

mdhaber commented Nov 13, 2024

View reviewed changes

mdhaber merged commit 62e7590 into scipy:main Nov 13, 2024
38 checks passed

mdhaber removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 17, 2024

Uh oh!

API: stats._resampling: transition to rng (SPEC 7) #21854

API: stats._resampling: transition to rng (SPEC 7) #21854

Uh oh!

Conversation

mdhaber commented Nov 9, 2024

Reference issue

What does this implement/fix?

Additional information

Uh oh!

andyfaff commented Nov 9, 2024

Uh oh!

andyfaff Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

mdhaber Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyfaff Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyfaff Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

mdhaber Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdhaber commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdhaber left a comment

Choose a reason for hiding this comment

Uh oh!

mdhaber Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyfaff Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

mdhaber Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

mdhaber left a comment

Choose a reason for hiding this comment

Uh oh!

mdhaber Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

andyfaff commented Nov 13, 2024

Uh oh!

mdhaber commented Nov 13, 2024

Uh oh!

Uh oh!

Uh oh!

API: `stats._resampling`: transition to rng (SPEC 7) #21854

API: `stats._resampling`: transition to rng (SPEC 7) #21854

mdhaber Nov 9, 2024 •

edited

Loading

andyfaff Nov 13, 2024 •

edited

Loading

mdhaber Nov 13, 2024 •

edited

Loading

mdhaber commented Nov 9, 2024 •

edited

Loading

mdhaber Nov 13, 2024 •

edited

Loading