ENH: stats: add array API-support

Towards gh-18867

This issue tracks progress toward the addition of array-API support to `scipy.stats` functions. The functions listed below look ready for conversion, and I'd be happy to review PRs for them. Priority, balancing the ease and importance of the task, is roughly in the order listed. 

- [x] `moment` (gh-20292)
- [ ] `lmoment`
- [x] `skew` (gh-20541 - please see this PR as an example roughly similar `kurtosis` through `directional_stats`)
- [x] `kurtosis` (gh-20658)
- [x] `describe` (gh-20667)
- [x] `entropy` (gh-20673)
- [x] `variation` (gh-20647)
- [x] `sem` (gh-20631)
- [x] `kstat` (gh-20634)*
- [x] `kstatvar` (gh-20634)*
- [x] `circmean` (gh-20595)
- [x] `circvar` (gh-20595)
- [x] `circstd` (gh-20595)
- [x] `directional_stats` (gh-20794)
- [x] `pearsonr` (gh-20284)
- [x] `ttest_1samp` (gh-20545 - please see this PR as an example for `ttest_rel` through `normaltest`)
- [x] `ttest_rel` (gh-20883)
- [x] `ttest_ind` (gh-20771)
- [x] `skewtest` (gh-20597)
- [x] `kurtosistest` (gh-20715)
- [x]  `normaltest` (gh-20736)
- [x] `jarque_bera` (gh-20736)
- [x] `power_divergence` (gh-20753)
- [x] `chisquare` (gh-20753)
- [x] `combine_pvalues` (gh-20900)
- [x] `gstd` (gh-22455)
- [x] `ttest_ind_from_stats` (gh-20771)
- [ ] `alexandergovern` (gh-21089)
- [x] ~~`find_repeats` - deprecated in gh-21157, removed in gh-23023~~

After that:
- [x] write a function for computing weighted average. (gh-20743)
- [x] `gmean` (gh-20946)
- [x] `hmean` (gh-20954, gh-20962, gh-21035)
- [x] `pmean` (gh-20954, gh-20962, gh-21035)

After that:
- [x] Add `_SimpleNormal` (gh-20777)
- [x] Add `_SimpleChi2` (gh-20782)
- [x] Add `_SimpleBeta` (gh-20793)
- [x] Add `_SimpleStudentT` (gh-20884)
- [x] Dispatch to array backend `stdtrit` where possible (gh-22222)

I'd like to implement the following using the approach of `_masked_array` (gh-20363):
- [x] `tmean` (gh-20965)
- [x] `tvar` (gh-21036)
- [x] `tmin` (gh-21028)
- [x] `tmax` (gh-21028)
- [x] `tstd` (gh-21036)
- [x] `tsem` (gh-21036)

I left the transformation functions off this list initially, but most of them should be relatively easy.
- [x] `xp_var` (gh-21034)
- [x] `zmap` (gh-21068)
- [x] `zscore` (gh-21068)
- [x] `gzscore` (gh-21068)
- [ ] `obrientransform` (gh-21055)
- [x] `boxcox_llf` (gh-21097; come back after `xp` `logsumexp` is done)
- [ ] `yeojohnson_llf`
- [ ] `boxcox_normmax` (Can use `_chandrupatla` or, when it merges, `optimize.elementwise.find_root`, gh-20800)
- [ ] `yeojohnson_normmax`
- [ ]  `boxcox`
- [ ] `yeojohnson`
- [ ] ~~`trim1` (would benefit from an `xp.partition`, but could use `xp.sort`)~~ See gh-22644.
- [ ] ~~`trimboth` (same)~~ See gh-22644.
- [ ] `sigmaclip` (same)

After that:
- [ ] ~~add N-D support to `_array_api.cov`; consider making it public if array API won't offer it~~ Not really necessary. We don't need something very general, so let's not get hung up on it.
- [ ] `linregress`: add `axis` and array API support
- [ ] `expectile`: add `axis` and array API support
- [ ] `ks_2samp`: consider natively vectorizing, then adding array API support
- [ ] `mode`: ~~consider natively vectorizing (e.g. see https://github.com/scipy/scipy/issues/9873#issuecomment-2677571382 for implementation), then adding array API support.~~ (gh-22857)
- [x] `bartlett`: consider natively vectorizing, then adding array API support (gh-20751)
- [ ] `levene`: consider natively vectorizing, then adding array API support
- [ ] `anderson_ksamp`: might be able to vectorize, then add array API support
- [ ] `wasserstein_distance`: consider natively vectorizing, then adding array API support
- [ ] `energy_distance`: consider natively vectorizing, then adding array API support

These functions are held up by `rankdata` (possibly among other things), ~~which is waiting for improved array-API support~~. See gh-20639.
- [ ] `kendalltau`
- [ ] `mannwhitneyu`
- [ ] `wilcoxon`
- [ ] `kruskal`
- [ ] `cramervonmises_2samp`
- [ ] `friedmanchisquare`
- [ ] `brunnermunzel`
- [ ] `ansari`
- [ ] `fligner`
- [ ] `mood`
- [ ] `spearmanr` (also need to re-define `axis` behavior)
- [ ] `chatterjeexi`

These functions need `median`, `quantile`, or similar, either directly or via `iqr`. See data-apis/array-api#795. 
- [x] So we don't need to wait for the array API, it would be helpful to write an `_xp_quantile` function using `xp.sort` (*Done. `scipy.stats.quantile` added in gh-22352.*) I'd like for it to include the following features, which will be useful elsewhere:
  - Native `axis` and `nan_policy` support. `sort` will typically push all the NaNs to one end or the other; we can count the finite values in each slice rather than using `.shape[axis]` to determine the index to `take_along_axis` (see gh-20639 for an implementation). 
  - Improved broadcasting behavior. (The following may not be intelligble without real-time discussion. I just wanted to record the thoughts somewhere.) The NumPy version of [`quantile`](https://numpy.org/devdocs/reference/generated/numpy.quantile.html) accepts a 1D array of probabilities. The specified quantiles are taken for *all* slices and aligned along a new axis `0` of the output. While convenient in the case that the user wants all quantiles for all slices, this does not follow normal broadcasting rules, and it does not allow for *different* probabilities for each slice (needed by `bootstrap`, for example). In similar situations in `stats`, we would allow for an n-d array of probabilities and follow normal broadcasting rules with the additional requirement that the length of the probabilities array along `axis` must be 1. (For example, see how `ttest_1samp` handles broadcasting with `popmean`.) This allows the use of different probabilities for each slice _or_ computation at all probabilites for all slices, depending on the alignment of the probability array.  However, there is an improvement to be made. When `keepdims=True`, we can relax the rule that the length of the percentiles along `axis` must be 1, and we can accept an array of percentiles aligned along the dimension(s) specified by `axis`. The quantile is computed at *all* of those probabilites for the corresponding slice, and these quantiles are aligned along the `axis` dimension(s) of the output array. Compared to aligning the percentiles orthogonal to the input sample array, this has the advantage that each slice needs to be sorted (or partitioned) only once rather than once per percentile, and it offers the convenience of the existing NumPy interface. @seberg is this intelligible to you, at least, based on our conversation at the summit?
- [ ] `iqr`
- [ ] `siegelslopes`
- [ ] `theilslopes`
- [ ] `median_test`
- [x] `median_abs_deviation`
- [ ] `epps_singleton_2samp`
- [ ] `levene` (optional)
- [ ] `fligner` (optional)
- [ ] `sen_seasonal_slopes`

I wrote the following, so I'd prefer to do the upgrades on those personally.
- [ ] All new distribution infrastructure features
- [x] `monte_carlo_test` (gh-20604)
- [ ] `permutation_test` (accepts RNG; uses `random`, `permuted`, and `permutation`)
- [ ] `bootstrap` (accepts RNG; uses `randint`/`integers`)
- [ ] `goodness_of_fit` (probably not very useful until we can fit distributions with array API)
- [ ] `power` (https://github.com/scipy/scipy/compare/main...mdhaber:scipy:xp_power)
- [ ] `false_discovery_control` (uses `take_along_axis`/`put_along_axis`)
- [x] `differential_entropy` (gh-21076)

Toward gh-22194, we'll be adding a few new functions to `scipy.stats`, and those should be array API compatible from the start:
- [x] quantile (gh-22505)
- [ ] something to replace `plotting_positions`
- [ ] ~~`trim`~~ See gh-22644.
- [ ] ~~`winsorize`~~ See gh-22644.
- [ ] Possibly functions to replace `mjci`, `mquantiles_cimj`, and `rsh`

After all that, it may be worth doing:
- [ ] `CensoredData`
- [ ] `ecdf` - relies on `CensoredData`. Might be worth doing after array API standard has `diff` with prepend, append. I wrote `xp_diff` in gh-20639, but it's slow.
- [ ] `logrank` - relies on `ecdf`
- [ ] `bws_test` - needs `permutation_test`
- [ ] `tukey_hsd` - probably not too bad, but most of the time is calculating `studentized_range` SF. Could vectorize computation with `_tanhsinh`, though.
- [ ] `pointbiserialr` - deprecate or implement using shortcut specific to binary data? It's just an alias for `pearsonr` right now.

---

I am not interested in working on or reviewing work on the following functions:
- [ ] `bayes_mvs` (consider deprecating)
- [ ] `mvsdist` (consider deprecating)
- [ ] `weightedtau` (consider deprecating)
- [ ] `multiscale_graphcorr` (consider deprecating)
- [ ] `tiecorrect` (consider deprecating)
- [ ] `ranksums` (consider deprecating)
- [ ] `somersd`
- [ ] `page_trend_test`
- [ ] `f_oneway`
- frequency statistics
  - [ ] `cumfreq`
  - [ ] `percentileofscore`
  - [ ] `scoreatpercentile`
  - [ ] `relfreq`
  - [ ] `binned_statistic`
  - [ ] `binned_statistic_2d`
  - [ ] `binned_statistic_dd`
- plot tests
  - [ ] `ppcc_max`
  - [ ] `ppcc_plot`
  - [ ] `probplot`
  - [ ] `boxcox_normplot`
  - [ ] `yeojohnson_normplot`
- [ ] Old univariate and multivariate distribution infrastructure (`rv_continuous`, `rv_discrete`, `rv_histogram`, etc.) and distributions 

Some of the `scipy.stats.contingency` functions would be feasible to work on, but some would probably need to be vectorized for it to make sense
  - [ ] `relative_risk`
  - [ ] `expected_freq`
  - [ ] `margins`
  - [ ] `chi2_contingency`
  - [ ] `association`
  - [ ] `fisher_exact` - needs hypergeometric distribution
  - [ ] `barnard_exact` - uses `shgo`; probably not a good candidate
  - [ ] `boschloo_exact` - uses `shgo`; probably not a good candidate
  - [ ] `odds_ratio` - I don't think the cost/benefit ratio looks good
  - [ ] `crosstab`

I don't think these are good candidates for translation.
- [ ] `trim_mean` - would benefit from array API `partition`
- [ ] `binomtest` - could probably be made elementwise, but would be a bit of work
- [ ] `quantile_test` - probably not bad, but needs binomial distribution functions (and inverses)
- [ ] `shapiro` - compiled. I've written the Shapiro test in pure Python, but normal distribution order statistic stuff was computed via numerical integration, and p-value was just `monte_carlo_test`, so it's probably faster to convert to NumPy, perform the test, and convert back.
- [ ] `anderson` - technically not too hard, but there are interface questions to be answered
- [ ] `cramervonmises` - probably not too bad once we have array API distributions
- [ ] `ks_1samp` - probably not too bad, but we need array API distributions and array API null distribution CDF/SF
- [ ] `kstest` - dispatches to `ks_1samp` and `ks_2samp`; easy once those are done! 
- [ ] `poisson_means_test` - theoretically this could be an elementwise function, but implementing would be tricky because scalar arguments are used to create `arange` which would naturally lead to ragged arrays
- [ ] `dunnett` - statistic is easy to vectorize; p-value is not.
- [ ] `scipy.stats.qmc` - mostly compiled
- [ ] `sobol_indices` - relies on `scipy.stats.qmc`
- [ ] `scipy.stats.sampling` - all compiled. Long-term goal: rewrite vectorized versions in terms of `scipy.interpolate`.
- [ ] `scipy.stats.mstats` - deprecate; accept [`marray`](https://github.com/mdhaber/marray)s in regular stats functions.
- [ ] `fit`. Relies heavily on `differential_evolution`, which would need array API support first.
- [ ] `wasserstein_distance_nd`. Uses linear programming. Unlikely to get efficient array API support any time soon.
- [ ] `gaussian_kde`. There could be an efficient array API implementation in terms of `multivariate_normal` if `Covariance` gets support for batch covariance matrices, which was drafted in https://github.com/mdhaber/scipy/pull/88. But it would be a complete rewrite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: stats: add array API-support #20544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: stats: add array API-support #20544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions