Skip to content

Conversation

mortonjt
Copy link
Contributor

@mortonjt mortonjt commented May 7, 2015

This is a new PR continuing from #4440

@argriffing argriffing added scipy.stats enhancement A new feature or improvement labels May 7, 2015
@rgommers
Copy link
Member

rgommers commented May 7, 2015

@mortonjt you lost something along the way it looks like. There's missing blank lines in the docstring that were fixed in the other PR, and I'm not sure you have the fix for unequal length inputs here.

Do you have your other branch still with the fixes you made on top of my branch? If you point me too it I'll try to clean things up.

@mortonjt
Copy link
Contributor Author

mortonjt commented May 7, 2015

I added in the white spaces again - I didn't notice these changes getting overwritten when I merged your branch....

Perhaps I'm not completely understanding the unequal lengths problem. In the last PR, you pointed out the following scenario

>>> from scipy import stats
>>> import numpy as np
>>> np.random.seed(0)
>>> rvs1 = stats.norm.rvs(loc=5, scale=10, size=500)
>>> rvs5 = stats.norm.rvs(loc=8, scale=20, size=100)
>>> np_t_stats, pvalues = stats.ttest_ind(rvs1.reshape((100, 5)), rvs5, permutations=1000)
Ttest_indResult(statistic=array([ 0.01273325,  0.39392577,  0.20826074, 0.05052847,  1.1114823 ]), pvalue=array([ 0.98985343,  0.69405975,  0.83523946,  0.9597522 ,  0.26770861]))

>>> stats.ttest_ind(rvs1.reshape((100, 5)), rvs5, permutations=1000)
Ttest_indResult(statistic=array([ 0.01273325,  0.39392577,  0.20826074,  0.05052847,  1.1114823 ]), pvalue=array([ 0.98801199,  0.69430569,  0.84015984,  0.97002997,  0.27972028]))

Now, in the above scenario, it makes sense to tile the second argument rvs5 5 times so that each instance of rvs1 can be tested against rvs5 correct? This is what the new solution does

It wouldn't make sense if rvs5 had more than 2 columns - it fact it would break the parametric t-test

In other words

>>> from scipy import stats
>>> import numpy as np
>>> np.random.seed(0)
>>> rvs1 = stats.norm.rvs(loc=5, scale=10, size=500)
>>> rvs5 = stats.norm.rvs(loc=8, scale=20, size=200)
>>> np_t_stats, pvalues = stats.ttest_ind(rvs1.reshape((100, 5)), rvs5.reshape((100, 2)))
ValueError                                Traceback (most recent call last)
<ipython-input-6-cf6799185f54> in <module>()
----> 1 np_t_stats, pvalues = stats.ttest_ind(rvs1.reshape((100, 5)), rvs5.reshape((100, 2)))

/Users/mortonjt/.virtualenvs/scipy/lib/python2.7/site-packages/scipy/stats/stats.pyc in ttest_ind(a, b, axis, equal_var, permutations, random_state)
   3700 
   3701         if equal_var:
-> 3702             df, denom = _equal_var_ttest_denom(v1, n1, v2, n2)
   3703         else:
   3704             df, denom = _unequal_var_ttest_denom(v1, n1, v2, n2)

/Users/mortonjt/.virtualenvs/scipy/lib/python2.7/site-packages/scipy/stats/stats.pyc in _equal_var_ttest_denom(v1, n1, v2, n2)
   3383 def _equal_var_ttest_denom(v1, n1, v2, n2):
   3384     df = n1 + n2 - 2
-> 3385     svar = ((n1 - 1) * v1 + (n2 - 1) * v2) / float(df)
   3386     denom = np.sqrt(svar * (1.0 / n1 + 1.0 / n2))
   3387     return df, denom

ValueError: operands could not be broadcast together with shapes (5,) (2,) 

Now, when I first saw your comment back in March, I jumped the gun. I was thinking that this permutation test would have to consider testing the following scenario

>>> import numpy as np
>>> from scipy import stats
>>> a = np.array([[1,2,3], [2,3]])
>>> b = np.array([[1,2], [2,3]])
>>> stats.ttest_ind(a,b)

TypeError                                 Traceback (most recent call last)
<ipython-input-15-4d314839bf04> in <module>()
----> 1 stats.ttest_ind(a,b)

/Users/mortonjt/.virtualenvs/scipy/lib/python2.7/site-packages/scipy/stats/stats.pyc in ttest_ind(a, b, axis, equal_var, permutations, random_state)
   3694 
   3695     else:
-> 3696         v1 = np.var(a, axis, ddof=1)
   3697         v2 = np.var(b, axis, ddof=1)
   3698         n1 = a.shape[axis]

/Users/mortonjt/.virtualenvs/scipy/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in var(a, axis, dtype, out, ddof, keepdims)
   2936 
   2937     return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 2938                                 keepdims=keepdims)

/Users/mortonjt/.virtualenvs/scipy/lib/python2.7/site-packages/numpy/core/_methods.pyc in _var(a, axis, dtype, out, ddof, keepdims)
     93     if isinstance(arrmean, mu.ndarray):
     94         arrmean = um.true_divide(
---> 95                 arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
     96     else:
     97         arrmean = arrmean.dtype.type(arrmean / rcount)

TypeError: unsupported operand type(s) for /: 'list' and 'int'

But the parametric t-test will break in this scenario, since a is now an array of objects, rather than an array of ints. So I think this scenario can be disregarded, unless we want to revamp the parametric t-test.

Does this make sense?

STY: Refactoring uneven lengths implementation

STY: Cleaning up some pep8 issues
@mortonjt
Copy link
Contributor Author

@rgommers , I double checked the diffs - all of the changes that you committed should be here now.

I'm still a bit confused about the uneven lengths problem. Mind if you could take another look over this PR when you have time? Thanks!

@ev-br
Copy link
Member

ev-br commented Nov 9, 2015

What's the status of this?

@ev-br ev-br added this to the 0.17.0 milestone Nov 16, 2015
@rgommers rgommers modified the milestones: 0.18.0, 0.17.0 Nov 30, 2015
@rgommers
Copy link
Member

@ev-br the status of this is that I'm being the bottleneck here and should look at the history in detail again. Realistically this is only going to happen after the branch. I'll set the milestone to 0.18.0 and self-assign it.

@rgommers rgommers self-assigned this Nov 30, 2015
@ev-br ev-br modified the milestones: 0.19.0, 0.18.0 Jun 19, 2016
@ev-br
Copy link
Member

ev-br commented Jun 19, 2016

I'm bumping the milestone. If it does get into 0.18.0, all the better :-).

@sturlamolden
Copy link
Contributor

For the record, the performance problems with numpy.random.RandomState.shuffle has been fixed upstream. We do not any longer need special Cython shuffling code or workarounds to make fast permutation tests.

@mortonjt
Copy link
Contributor Author

mortonjt commented Jul 7, 2016

That's very good news! I'll try to redo the benchmarks - I'd be curious to
see how much of an improvement it will yield.
On Jul 6, 2016 10:18 PM, "Sturla Molden" notifications@github.com wrote:

For the record, the performance problems with
numpy.random.RandomState.shuffle has been fixed upstream. We do not any
longer need special Cython shuffling code or workarounds to make fast
permutation tests.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4824 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AD_a3VW6sbU6_x7JVol6qdCH9s3WDGNSks5qTIw_gaJpZM4ER-t0
.

@ev-br ev-br modified the milestones: 1.0, 0.19.0 Feb 12, 2017
@jrubinstein
Copy link

What's the status of this change? I'm looking for the equivalent of perm_ts() in R for Python. Has this been incorporated into scipy.stats? I tried to run this np_t_stats, pvalues = stats.ttest_ind(rvs1.reshape((100, 5)), rvs5, permutations=1000) , but got TypeError: ttest_ind() got an unexpected keyword argument 'permutations'

@rgommers
Copy link
Member

@jrubinstein it hasn't been incorporated, due to lack of reviewer time (mine in particular). If you are interested, testing the code in this PR and providing feedback on that would be useful.

@jrubinstein
Copy link

@rgommers I would love to test this. Could probably do so today, even. But, alas, I have no clue how to install from a PR. I can see some instructions on installing from a branch, but that's as far as my google-fu takes me. Is there a good guide for installing from a PR?
Thanks!

@rgommers
Copy link
Member

At the top of this page it says:

mortonjt wants to merge 2 commits into scipy:master from mortonjt:ttest 

That branch is https://github.com/mortonjt/scipy/tree/ttest. So it'd be (in your local repo):

git remote add mortonjt https://github.com/mortonjt/scipy.git    
git fetch mortonjt
git co -b ttest mortonjt/ttest   # now you have a branch with the same content
# follow the normal instructions to build from source

@rgommers rgommers removed this from the 1.0.0 milestone Sep 9, 2017
@hoechenberger
Copy link

Any chances of this still getting merged? (after the merge conflicts are resolved) I'd be willing to help test the code.

@rgommers
Copy link
Member

@hoechenberger yes, we could get this merged. Help with testing/reviewing would be much appreciated.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber thanks for the change, it might be minor but more intuitive to read (for my eyes anyway). The codecov adds a lot of noise to the PR but is unrelated. AFAICS the PR is mostly ready (sans the _validate_int method). Once #12603 is merged we can use the function.

One thought though: given that PR (#12603) is in draft mode ATM, could we cherry-pick the code for the _validate_int method here, use it and then rebase #12603 off master?

If that seems more complicated then we can wait on this one to get merged, just proposing alternative approaches to see if they stick.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 25, 2020

Actually @rlucas7 i don't want to use _validate_int anymore. It's too strict for my taste. I added my own input validation (one line) and added a test for it. So if it looks good to you, it's all set from my perspective.
.

indices = np.array([random_state.permutation(m) for i in range(n)]).T
else:
n = n_max
indices = np.array(list(permutations(range(m)))).T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to merge as is, but afterwards (at some point) I will submit a PR that makes this much more efficient. We don't need all of these permutations because the t-statistic doesn't care about the order of the numbers within the groups, just which group the numbers are in. Since I added the exact test while thinking about the permutation test, I didn't think that through. Anyway, this should give the same result; it's just that taking combinations is way more efficient.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber Thanks for clarifying the change w.r.t. the _validate_int() function.

I took another pass at this tonight. It looks fine from my perspective. I'll leave it open for the rest of week and merge on Sunday unless further comments.

@rlucas7
Copy link
Member

rlucas7 commented Jan 11, 2021

@mdhaber was planning to merge tonight but will wait for the build to complete with this last tiny patch you suggested, I'll plan to merge tomorrow evening.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 11, 2021

Ok, thanks! Anything you want me to take a look at?
If you're interested in other resurrected PRs, gh-4933 addresses about a dozen mannwhitneyu issues.

@rlucas7
Copy link
Member

rlucas7 commented Jan 11, 2021

@mdhaber was planning to merge tonight but will wait for the build to complete with this last tiny patch you suggested, I'll plan to merge tomorrow evening.

Actually, I'm up (much) earlier than usual and there weren't any surprises in the latest patch. So squashing and merging, thanks @mdhaber and @mortonjt for the PR and to @rgommers and @sturlamolden for the reviews.

I'll add something to release notes too.

@rlucas7 rlucas7 merged commit 255d49c into scipy:master Jan 11, 2021
@rlucas7
Copy link
Member

rlucas7 commented Jan 11, 2021

Ok, thanks! Anything you want me to take a look at?
If you're interested in other resurrected PRs, gh-4933 addresses about a dozen mannwhitneyu issues.

If I have some time next weekend I'll take a look-it would be good to finish up, or if necessary close out, some of the oldest PRs,

@rgommers rgommers added this to the 1.7.0 milestone Jan 11, 2021
@rgommers
Copy link
Member

This must be just about a record - just shy of 6 years to get this merged. Happy to finally see that happen! Thanks @mortonjt, and @rlucas7, @mdhaber and all other reviewers!

@dbolser
Copy link

dbolser commented Oct 2, 2024

How does the permutation flag compare to the 'permutation test proper' within scypy.stats? Are they the same under the hood? Sorry for naive questions 😅

@mdhaber
Copy link
Contributor

mdhaber commented Oct 2, 2024

It's just history - the permutation t-test was requested first, so it was added, then it was obvious that a general function would be useful.
I think the exact test (with small enough samples) will give the same results. Randomized tests might give sightly different results even with the same random state. The only conceptual difference I remember is the two-sided p-value. The permutation t-test, IIRC, counts the elements of the permutation distribution with absolute value greater than the observed statistic. permutation_test doubles the minimum of the two one-sided p-values. These are two (of the three) common conventions for two-sided p-values. The absolute value only works when the null distribution is expected to be symmetric, so it couldn't be used for the more general permutation_test. @steppi pointed out that the doubling of the smaller p-value can be interpreted as a Bonferroni correction when testing multiple hypotheses testing (it's equivalent to performing both one-sided tests and halving your threshold for significance).

@dbolser
Copy link

dbolser commented Oct 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.