-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
API: stats._resampling
: transition to rng (SPEC 7)
#21854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yeah, I think that has the potential to confuse a lot of people. |
def batched_perm_generator(): | ||
indices = np.arange(n_obs_sample) | ||
indices = np.tile(indices, (batch, n_samples, 1)) | ||
for k in range(0, n_permutations, batch): | ||
batch_actual = min(batch, n_permutations-k) | ||
# Don't permute in place, otherwise results depend on `batch` | ||
permuted_indices = random_state.permuted(indices, axis=-1) | ||
permuted_indices = rng.permuted(indices, axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this section be simplified by using rng.permutation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? Note that this section is doing different things with RandomState
s and Generator
s for a reason. Half of this can go away when we remove support for RandomState
. If there is additional simplification possible for Generator
s, it seems separate from this PR, right?
Update: What did you have in mind? Some things to keep in mind: We need to permute slices independently because this works with N-D arrays. We need permuted indices, not a permutation of the original data, because there is also an exhaustive permutation method. And we use a different method for RandomState
than Generator
because RandomState.permutation
is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Generator
and RandomState
have the permutation
method. I think the method for each type shuffles slices independently, and if indices
is an array the shuffling is done on a copy.
(Also, I think Generator.permute(indices, axis=-1)
and Generator.permutation(indices, axis=-1)
are somewhat equivalent. It's not clear to me why there are two Generator methods doing the same thing)
EDIT: Generator.permutation
does not shuffle slices independently.
If RandomState.permutation
is slow then that's probably justification for sticking with status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I look into it further, RandomState.permutation
only shuffles along the first axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
Generator
andRandomState
have thepermutation
method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.
Rather than just thinking about it, let's refer to the documentation
and test it:
import numpy as np
A = np.arange(5)
B = np.stack([A, A, A, A, A])
np.random.RandomState().permutation(B.T)
# array([[0, 0, 0, 0, 0],
# [1, 1, 1, 1, 1],
# [3, 3, 3, 3, 3],
# [4, 4, 4, 4, 4],
# [2, 2, 2, 2, 2]])
So no, RandomState.permutation
does not shuffle slices independently.
If
RandomState.permutation
is slow then that's probably justification for sticking with status quo.
What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state
to rng
, I agree!
You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.
I don't think we can change all of that until gh-21707 merges, but after that I can go throug and replace uses of old distributions with new. |
stats._resampling
: transition to rng (SPEC 7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andyfaff. Did you have any comments about this PR, or does it look OK?
def batched_perm_generator(): | ||
indices = np.arange(n_obs_sample) | ||
indices = np.tile(indices, (batch, n_samples, 1)) | ||
for k in range(0, n_permutations, batch): | ||
batch_actual = min(batch, n_permutations-k) | ||
# Don't permute in place, otherwise results depend on `batch` | ||
permuted_indices = random_state.permuted(indices, axis=-1) | ||
permuted_indices = rng.permuted(indices, axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
Generator
andRandomState
have thepermutation
method. I think the method for each type shuffles slices independently, and if indices is an array the shuffling is done on a copy.
Rather than just thinking about it, let's refer to the documentation
and test it:
import numpy as np
A = np.arange(5)
B = np.stack([A, A, A, A, A])
np.random.RandomState().permutation(B.T)
# array([[0, 0, 0, 0, 0],
# [1, 1, 1, 1, 1],
# [3, 3, 3, 3, 3],
# [4, 4, 4, 4, 4],
# [2, 2, 2, 2, 2]])
So no, RandomState.permutation
does not shuffle slices independently.
If
RandomState.permutation
is slow then that's probably justification for sticking with status quo.
What is the "status quo" in this context? If you mean leaving this code alone and keeping this scope of this PR to the transition from random_state
to rng
, I agree!
You can see in the blame that this code comes from gh-17030, which improved the performance by a factor of 25~50x relative to what it was before.
@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch, | |||
# can permute axis-slices independently. If this feature is | |||
# added in the future, batches of the desired size should be | |||
# generated in a single call. | |||
perm_generator = (random_state.permutation(n_obs) | |||
perm_generator = (rng.permutation(n_obs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually Generator.permuted
should be able to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see in the code above and in the documentation that Generator.permuted
can do this now. At the time this code was merged (2021), permuted
was not available in all supported versions of NumPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andyfaff. Did you have any comments about this PR, or did it look good?
@@ -1455,7 +1450,7 @@ def _calculate_null_both(data, statistic, n_permutations, batch, | |||
# can permute axis-slices independently. If this feature is | |||
# added in the future, batches of the desired size should be | |||
# generated in a single call. | |||
perm_generator = (random_state.permutation(n_obs) | |||
perm_generator = (rng.permutation(n_obs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see in the code above and in the documentation that Generator.permuted
can do this now. At the time this code was merged (2021), permuted
was not available in all supported versions of NumPy.
I think the PR is mergeable as-is. However, I think it's also worth the time modifying the tests a.la. #21872, i.e.
Perhaps it can be done in a different PR, but eventually it's going to have to be done for free-threaded compatibility (#21496). |
Great! I'll go ahead and do that.
Yes, thanks. In the interest of getting the SPEC 7 transitions done in one release cycle, I'd like to focus SPEC 7 PRs on SPEC 7. So I'll go ahead and merge this so that someone can adjust the tests without the potential to run into merge conflicts. |
Reference issue
Toward gh-21833
What does this implement/fix?
I combined the two functions in
scipy.stats._resampling
into one PR because it seemed like a logical unit. I'll need to do the two classesBootstrapMethod
andPermutationMethod
in a separate PR because they have attributesrandom_state
that may require additional attention.Additional information
I'm noticing that a lot of uses of
random_state
need to remain in these files because they use old distributions. Most of these are just the normal distribution, though, so it would be easy enough to come back and change these to use the new infrastructure so use ofrng
is more consistent (especially in documentation).