-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
Towards gh-18867
This issue tracks progress toward the addition of array-API support to scipy.stats
functions. The functions listed below look ready for conversion, and I'd be happy to review PRs for them. Priority, balancing the ease and importance of the task, is roughly in the order listed.
-
moment
(ENH: stats.moment: add array API support #20292) -
lmoment
-
skew
(ENH: stats.skew: add array-API support #20541 - please see this PR as an example roughly similarkurtosis
throughdirectional_stats
) -
kurtosis
(ENH: stats.kurtosis: add array API support #20658) -
describe
(ENH: stats.describe: add array API support #20667) -
entropy
(ENH: stats.entropy, special.{entr, rel_entr}: add array API support #20673) -
variation
(ENH: stats.variation: add array-API support #20647) -
sem
(ENH: stats.sem: add array-API support #20631) -
kstat
(ENH: stats: add array-API support to kstat/kstatvar #20634)* -
kstatvar
(ENH: stats: add array-API support to kstat/kstatvar #20634)* -
circmean
(ENH:stats.circ___
: add array-API support #20595) -
circvar
(ENH:stats.circ___
: add array-API support #20595) -
circstd
(ENH:stats.circ___
: add array-API support #20595) -
directional_stats
(ENH: stats: add array API support fordirectional_stats
#20794) -
pearsonr
(ENH: stats.pearsonr: add array API support #20284) -
ttest_1samp
(ENH: stats.ttest_1samp: add array-API support #20545 - please see this PR as an example forttest_rel
throughnormaltest
) -
ttest_rel
(ENH: stats: rewritettest_rel
in terms ofttest_1samp
#20883) -
ttest_ind
(ENH:stats.ttest_ind
: add array API support #20771) -
skewtest
(ENH: stats.skewtest: add array-API support #20597) -
kurtosistest
(ENH: stats.kurtosistest: add array API support #20715) -
normaltest
(ENH: stats.normaltest/jarque_bera: add array-API support #20736) -
jarque_bera
(ENH: stats.normaltest/jarque_bera: add array-API support #20736) -
power_divergence
(ENH: stats.chisquare/power_divergence: add array API support #20753) -
chisquare
(ENH: stats.chisquare/power_divergence: add array API support #20753) -
combine_pvalues
(ENH: stats: add array API support to combine_pvalues #20900) -
gstd
(ENH: stats.gstd: add array API support #22455) -
ttest_ind_from_stats
(ENH:stats.ttest_ind
: add array API support #20771) -
alexandergovern
(ENH:stats.alexandergovern
: vectorize calculation for n-D arrays #21089) -
find_repeats
- deprecated in DEP:stats.find_repeats
: deprecate function #21157, removed in DEP: stats: remove find_repeats #23023
After that:
- write a function for computing weighted average. (ENH:
stats._xp_mean
, an array API compatiblemean
withweights
andnan_policy
#20743) -
gmean
(ENH: stats.gmean: add array API support #20946) -
hmean
(MAINT: stats.hmean/pmean: simplify prior to array API conversion #20954, DOC/MAINT: stats.gmean/gstd/hmean/pmean: document/treat invalid input consistently #20962, ENH: stats.hmean/pmean: add array API support #21035) -
pmean
(MAINT: stats.hmean/pmean: simplify prior to array API conversion #20954, DOC/MAINT: stats.gmean/gstd/hmean/pmean: document/treat invalid input consistently #20962, ENH: stats.hmean/pmean: add array API support #21035)
After that:
- Add
_SimpleNormal
(ENH: stats: end-to-end array-API support for normality tests #20777) - Add
_SimpleChi2
(ENH: stats: end-to-end array-API support for NHSTs with chi-squared null distribution #20782) - Add
_SimpleBeta
(ENH: stats: end-to-end array-API support for NHSTs with beta null distribution #20793) - Add
_SimpleStudentT
(ENH: stats: end-to-end array-API support for NHSTs with Student's t null distribution #20884) - Dispatch to array backend
stdtrit
where possible (ENH:special
/stats
: implement xp-compatiblestdtrit
and use instats
#22222)
I'd like to implement the following using the approach of _masked_array
(gh-20363):
-
tmean
(ENH:stats.tmean
: add array API support #20965) -
tvar
(ENH: stats.tvar/tstd/tsem: add array API support #21036) -
tmin
(ENH: stats.tmin/tmax: add array API support #21028) -
tmax
(ENH: stats.tmin/tmax: add array API support #21028) -
tstd
(ENH: stats.tvar/tstd/tsem: add array API support #21036) -
tsem
(ENH: stats.tvar/tstd/tsem: add array API support #21036)
I left the transformation functions off this list initially, but most of them should be relatively easy.
-
xp_var
(ENH:stats.xp_var
: array-API compatible variance withscipy.stats
interface #21034) -
zmap
(ENH:stats.zmap
/zscore
/gzscore
: add array API support #21068) -
zscore
(ENH:stats.zmap
/zscore
/gzscore
: add array API support #21068) -
gzscore
(ENH:stats.zmap
/zscore
/gzscore
: add array API support #21068) -
obrientransform
(ENH: stats.obrientransform: add array API support #21055) -
boxcox_llf
(ENH:stats.boxcox_llf
: add array API support #21097; come back afterxp
logsumexp
is done) -
yeojohnson_llf
-
boxcox_normmax
(Can use_chandrupatla
or, when it merges,optimize.elementwise.find_root
, ENH: optimize.elementwise: vectorized scalar optimization and root finding tools #20800) -
yeojohnson_normmax
-
boxcox
-
yeojohnson
-
See ENH: stats.quantile: methods to support trimming/Winsorizing #22644.trim1
(would benefit from anxp.partition
, but could usexp.sort
) -
See ENH: stats.quantile: methods to support trimming/Winsorizing #22644.trimboth
(same) -
sigmaclip
(same)
After that:
-
add N-D support toNot really necessary. We don't need something very general, so let's not get hung up on it._array_api.cov
; consider making it public if array API won't offer it -
linregress
: addaxis
and array API support -
expectile
: addaxis
and array API support -
ks_2samp
: consider natively vectorizing, then adding array API support -
mode
:consider natively vectorizing (e.g. see ENH: ndimage: majority voting filter #9873 (comment) for implementation), then adding array API support.(ENH: stats: add array API support to some of_axis_nan_policy
decorator #22857) -
bartlett
: consider natively vectorizing, then adding array API support (ENH: stats.bartlett: add nativeaxis
and array API support #20751) -
levene
: consider natively vectorizing, then adding array API support -
anderson_ksamp
: might be able to vectorize, then add array API support -
wasserstein_distance
: consider natively vectorizing, then adding array API support -
energy_distance
: consider natively vectorizing, then adding array API support
These functions are held up by rankdata
(possibly among other things), which is waiting for improved array-API support. See gh-20639.
-
kendalltau
-
mannwhitneyu
-
wilcoxon
-
kruskal
-
cramervonmises_2samp
-
friedmanchisquare
-
brunnermunzel
-
ansari
-
fligner
-
mood
-
spearmanr
(also need to re-defineaxis
behavior) -
chatterjeexi
These functions need median
, quantile
, or similar, either directly or via iqr
. See data-apis/array-api#795.
- So we don't need to wait for the array API, it would be helpful to write an
_xp_quantile
function usingxp.sort
(Done.scipy.stats.quantile
added in ENH: stats.quantile: add array API compatible quantile function #22352.) I'd like for it to include the following features, which will be useful elsewhere:- Native
axis
andnan_policy
support.sort
will typically push all the NaNs to one end or the other; we can count the finite values in each slice rather than using.shape[axis]
to determine the index totake_along_axis
(see ENH: stats.rankdata: add array API standard support #20639 for an implementation). - Improved broadcasting behavior. (The following may not be intelligble without real-time discussion. I just wanted to record the thoughts somewhere.) The NumPy version of
quantile
accepts a 1D array of probabilities. The specified quantiles are taken for all slices and aligned along a new axis0
of the output. While convenient in the case that the user wants all quantiles for all slices, this does not follow normal broadcasting rules, and it does not allow for different probabilities for each slice (needed bybootstrap
, for example). In similar situations instats
, we would allow for an n-d array of probabilities and follow normal broadcasting rules with the additional requirement that the length of the probabilities array alongaxis
must be 1. (For example, see howttest_1samp
handles broadcasting withpopmean
.) This allows the use of different probabilities for each slice or computation at all probabilites for all slices, depending on the alignment of the probability array. However, there is an improvement to be made. Whenkeepdims=True
, we can relax the rule that the length of the percentiles alongaxis
must be 1, and we can accept an array of percentiles aligned along the dimension(s) specified byaxis
. The quantile is computed at all of those probabilites for the corresponding slice, and these quantiles are aligned along theaxis
dimension(s) of the output array. Compared to aligning the percentiles orthogonal to the input sample array, this has the advantage that each slice needs to be sorted (or partitioned) only once rather than once per percentile, and it offers the convenience of the existing NumPy interface. @seberg is this intelligible to you, at least, based on our conversation at the summit?
- Native
-
iqr
-
siegelslopes
-
theilslopes
-
median_test
-
median_abs_deviation
-
epps_singleton_2samp
-
levene
(optional) -
fligner
(optional) -
sen_seasonal_slopes
I wrote the following, so I'd prefer to do the upgrades on those personally.
- All new distribution infrastructure features
-
monte_carlo_test
(ENH: stats.monte_carlo_test: add array API support #20604) -
permutation_test
(accepts RNG; usesrandom
,permuted
, andpermutation
) -
bootstrap
(accepts RNG; usesrandint
/integers
) -
goodness_of_fit
(probably not very useful until we can fit distributions with array API) -
power
(main...mdhaber:scipy:xp_power) -
false_discovery_control
(usestake_along_axis
/put_along_axis
) -
differential_entropy
(ENH:stats.differential_entropy
: add array API support #21076)
Toward gh-22194, we'll be adding a few new functions to scipy.stats
, and those should be array API compatible from the start:
- quantile (ENH: stats.quantile: add discontinuous (HF 1-3) and Harrell-Davis methods; add
marray
support #22505) - something to replace
plotting_positions
-
See ENH: stats.quantile: methods to support trimming/Winsorizing #22644.trim
-
See ENH: stats.quantile: methods to support trimming/Winsorizing #22644.winsorize
- Possibly functions to replace
mjci
,mquantiles_cimj
, andrsh
After all that, it may be worth doing:
-
CensoredData
-
ecdf
- relies onCensoredData
. Might be worth doing after array API standard hasdiff
with prepend, append. I wrotexp_diff
in ENH: stats.rankdata: add array API standard support #20639, but it's slow. -
logrank
- relies onecdf
-
bws_test
- needspermutation_test
-
tukey_hsd
- probably not too bad, but most of the time is calculatingstudentized_range
SF. Could vectorize computation with_tanhsinh
, though. -
pointbiserialr
- deprecate or implement using shortcut specific to binary data? It's just an alias forpearsonr
right now.
I am not interested in working on or reviewing work on the following functions:
-
bayes_mvs
(consider deprecating) -
mvsdist
(consider deprecating) -
weightedtau
(consider deprecating) -
multiscale_graphcorr
(consider deprecating) -
tiecorrect
(consider deprecating) -
ranksums
(consider deprecating) -
somersd
-
page_trend_test
-
f_oneway
- frequency statistics
-
cumfreq
-
percentileofscore
-
scoreatpercentile
-
relfreq
-
binned_statistic
-
binned_statistic_2d
-
binned_statistic_dd
-
- plot tests
-
ppcc_max
-
ppcc_plot
-
probplot
-
boxcox_normplot
-
yeojohnson_normplot
-
- Old univariate and multivariate distribution infrastructure (
rv_continuous
,rv_discrete
,rv_histogram
, etc.) and distributions
Some of the scipy.stats.contingency
functions would be feasible to work on, but some would probably need to be vectorized for it to make sense
-
relative_risk
-
expected_freq
-
margins
-
chi2_contingency
-
association
-
fisher_exact
- needs hypergeometric distribution -
barnard_exact
- usesshgo
; probably not a good candidate -
boschloo_exact
- usesshgo
; probably not a good candidate -
odds_ratio
- I don't think the cost/benefit ratio looks good -
crosstab
I don't think these are good candidates for translation.
-
trim_mean
- would benefit from array APIpartition
-
binomtest
- could probably be made elementwise, but would be a bit of work -
quantile_test
- probably not bad, but needs binomial distribution functions (and inverses) -
shapiro
- compiled. I've written the Shapiro test in pure Python, but normal distribution order statistic stuff was computed via numerical integration, and p-value was justmonte_carlo_test
, so it's probably faster to convert to NumPy, perform the test, and convert back. -
anderson
- technically not too hard, but there are interface questions to be answered -
cramervonmises
- probably not too bad once we have array API distributions -
ks_1samp
- probably not too bad, but we need array API distributions and array API null distribution CDF/SF -
kstest
- dispatches toks_1samp
andks_2samp
; easy once those are done! -
poisson_means_test
- theoretically this could be an elementwise function, but implementing would be tricky because scalar arguments are used to createarange
which would naturally lead to ragged arrays -
dunnett
- statistic is easy to vectorize; p-value is not. -
scipy.stats.qmc
- mostly compiled -
sobol_indices
- relies onscipy.stats.qmc
-
scipy.stats.sampling
- all compiled. Long-term goal: rewrite vectorized versions in terms ofscipy.interpolate
. -
scipy.stats.mstats
- deprecate; acceptmarray
s in regular stats functions. -
fit
. Relies heavily ondifferential_evolution
, which would need array API support first. -
wasserstein_distance_nd
. Uses linear programming. Unlikely to get efficient array API support any time soon. -
gaussian_kde
. There could be an efficient array API implementation in terms ofmultivariate_normal
ifCovariance
gets support for batch covariance matrices, which was drafted in ENH: stats.multivariate: introduceCovariance
class and subclasses mdhaber/scipy#88. But it would be a complete rewrite.