Skip to content

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Nov 9, 2024

Reference issue

Toward gh-21833

What does this implement/fix?

I combined the two functions in scipy.stats._resampling into one PR because it seemed like a logical unit. I'll need to do the two classes BootstrapMethod and PermutationMethod in a separate PR because they have attributes random_state that may require additional attention.

Additional information

I'm noticing that a lot of uses of random_state need to remain in these files because they use old distributions. Most of these are just the normal distribution, though, so it would be easy enough to come back and change these to use the new infrastructure so use of rng is more consistent (especially in documentation).

@mdhaber mdhaber added the maintenance Items related to regular maintenance tasks label Nov 9, 2024
@mdhaber mdhaber mentioned this pull request Nov 9, 2024
50 tasks
@andyfaff
Copy link
Contributor

andyfaff commented Nov 9, 2024

I'm noticing that a lot of uses of random_state need to remain in these files because they use old distributions.

Yeah, I think that has the potential to confuse a lot of people.

def batched_perm_generator():
indices = np.arange(n_obs_sample)
indices = np.tile(indices, (batch, n_samples, 1))
for k in range(0, n_permutations, batch):
batch_actual = min(batch, n_permutations-k)
# Don't permute in place, otherwise results depend on `batch`
permuted_indices = random_state.permuted(indices, axis=-1)
permuted_indices = rng.permuted(indices, axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this section be simplified by using rng.permutation?

Copy link
Contributor Author

@mdhaber mdhaber Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? Note that this section is doing different things with RandomStates and Generators for a reason. Half of this can go away when we remove support for RandomState. If there is additional simplification possible for Generators, it seems separate from this PR, right?

Update: What did you have in mind? Some things to keep in mind: We need to permute slices independently because this works with N-D arrays. We need permuted indices, not a permutation of the original data, because there is also an exhaustive permutation method. And we use a different method for RandomState than Generator because RandomState.permutation is slow.

Copy link
Contributor

@andyfaff andyfaff Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.

(Also, I think Generator.permute(indices, axis=-1) and Generator.permutation(indices, axis=-1) are somewhat equivalent. It's not clear to me why there are two Generator methods doing the same thing)

EDIT: Generator.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I look into it further, RandomState.permutation only shuffles along the first axis.

Copy link
Contributor Author

@mdhaber mdhaber Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.

Rather than just thinking about it, let's refer to the documentation

image

and test it:

import numpy as np
A = np.arange(5)
B = np.stack([A, A, A, A, A])
np.random.RandomState().permutation(B.T)
# array([[0, 0, 0, 0, 0],
#        [1, 1, 1, 1, 1],
#        [3, 3, 3, 3, 3],
#        [4, 4, 4, 4, 4],
#        [2, 2, 2, 2, 2]])

So no, RandomState.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state to rng, I agree!

You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 9, 2024

Yeah, I think that has the potential to confuse a lot of people.

I don't think we can change all of that until gh-21707 merges, but after that I can go throug and replace uses of old distributions with new.

@mdhaber mdhaber added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 9, 2024
@lucascolley lucascolley changed the title MAINT: stats._resampling: transition to rng API: stats._resampling: transition to rng (SPEC 7) Nov 10, 2024
@lucascolley lucascolley added this to the 1.15.0 milestone Nov 10, 2024
@mdhaber mdhaber requested a review from andyfaff November 12, 2024 18:10
Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andyfaff. Did you have any comments about this PR, or does it look OK?

def batched_perm_generator():
indices = np.arange(n_obs_sample)
indices = np.tile(indices, (batch, n_samples, 1))
for k in range(0, n_permutations, batch):
batch_actual = min(batch, n_permutations-k)
# Don't permute in place, otherwise results depend on `batch`
permuted_indices = random_state.permuted(indices, axis=-1)
permuted_indices = rng.permuted(indices, axis=-1)
Copy link
Contributor Author

@mdhaber mdhaber Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Generator and RandomState have the permutation method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.

Rather than just thinking about it, let's refer to the documentation

image

and test it:

import numpy as np
A = np.arange(5)
B = np.stack([A, A, A, A, A])
np.random.RandomState().permutation(B.T)
# array([[0, 0, 0, 0, 0],
#        [1, 1, 1, 1, 1],
#        [3, 3, 3, 3, 3],
#        [4, 4, 4, 4, 4],
#        [2, 2, 2, 2, 2]])

So no, RandomState.permutation does not shuffle slices independently.

If RandomState.permutation is slow then that's probably justification for sticking with status quo.

What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state to rng, I agree!

You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.

@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch,
# can permute axis-slices independently. If this feature is
# added in the future, batches of the desired size should be
# generated in a single call.
perm_generator = (random_state.permutation(n_obs)
perm_generator = (rng.permutation(n_obs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually Generator.permuted should be able to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in the code above and in the documentation that Generator.permuted can do this now. At the time this code was merged (2021), permuted was not available in all supported versions of NumPy.

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andyfaff. Did you have any comments about this PR, or did it look good?

@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch,
# can permute axis-slices independently. If this feature is
# added in the future, batches of the desired size should be
# generated in a single call.
perm_generator = (random_state.permutation(n_obs)
perm_generator = (rng.permutation(n_obs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in the code above and in the documentation that Generator.permuted can do this now. At the time this code was merged (2021), permuted was not available in all supported versions of NumPy.

@andyfaff
Copy link
Contributor

I think the PR is mergeable as-is.

However, I think it's also worth the time modifying the tests a.la. #21872, i.e.

  1. Removing all occurrences of np.random.seed
  2. Removing all usages of the global RandomState that generate test data (e.g. here) and converting them to use a specific instance of a rng.
  3. When 2 is carried out, the specific instance may as well be a Generator.

Perhaps it can be done in a different PR, but eventually it's going to have to be done for free-threaded compatibility (#21496).

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 13, 2024

I think the PR is mergeable as-is.

Great! I'll go ahead and do that.

However, I think it's also worth the time modifying the tests a.la. #21872, ... Perhaps it can be done in a different PR

Yes, thanks. In the interest of getting the SPEC 7 transitions done in one release cycle, I'd like to focus SPEC 7 PRs on SPEC 7. So I'll go ahead and merge this so that someone can adjust the tests without the potential to run into merge conflicts.

@mdhaber mdhaber merged commit 62e7590 into scipy:main Nov 13, 2024
38 checks passed
@mdhaber mdhaber removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants