Skip to content

ENH: stats: add array API-support #20544

@mdhaber

Description

@mdhaber

Towards gh-18867

This issue tracks progress toward the addition of array-API support to scipy.stats functions. The functions listed below look ready for conversion, and I'd be happy to review PRs for them. Priority, balancing the ease and importance of the task, is roughly in the order listed.

After that:

After that:

I'd like to implement the following using the approach of _masked_array (gh-20363):

I left the transformation functions off this list initially, but most of them should be relatively easy.

After that:

  • add N-D support to _array_api.cov; consider making it public if array API won't offer it Not really necessary. We don't need something very general, so let's not get hung up on it.
  • linregress: add axis and array API support
  • expectile: add axis and array API support
  • ks_2samp: consider natively vectorizing, then adding array API support
  • mode: consider natively vectorizing (e.g. see ENH: ndimage: majority voting filter #9873 (comment) for implementation), then adding array API support. (ENH: stats: add array API support to some of _axis_nan_policy decorator #22857)
  • bartlett: consider natively vectorizing, then adding array API support (ENH: stats.bartlett: add native axis and array API support #20751)
  • levene: consider natively vectorizing, then adding array API support
  • anderson_ksamp: might be able to vectorize, then add array API support
  • wasserstein_distance: consider natively vectorizing, then adding array API support
  • energy_distance: consider natively vectorizing, then adding array API support

These functions are held up by rankdata (possibly among other things), which is waiting for improved array-API support. See gh-20639.

  • kendalltau
  • mannwhitneyu
  • wilcoxon
  • kruskal
  • cramervonmises_2samp
  • friedmanchisquare
  • brunnermunzel
  • ansari
  • fligner
  • mood
  • spearmanr (also need to re-define axis behavior)
  • chatterjeexi

These functions need median, quantile, or similar, either directly or via iqr. See data-apis/array-api#795.

  • So we don't need to wait for the array API, it would be helpful to write an _xp_quantile function using xp.sort (Done. scipy.stats.quantile added in ENH: stats.quantile: add array API compatible quantile function #22352.) I'd like for it to include the following features, which will be useful elsewhere:
    • Native axis and nan_policy support. sort will typically push all the NaNs to one end or the other; we can count the finite values in each slice rather than using .shape[axis] to determine the index to take_along_axis (see ENH: stats.rankdata: add array API standard support #20639 for an implementation).
    • Improved broadcasting behavior. (The following may not be intelligble without real-time discussion. I just wanted to record the thoughts somewhere.) The NumPy version of quantile accepts a 1D array of probabilities. The specified quantiles are taken for all slices and aligned along a new axis 0 of the output. While convenient in the case that the user wants all quantiles for all slices, this does not follow normal broadcasting rules, and it does not allow for different probabilities for each slice (needed by bootstrap, for example). In similar situations in stats, we would allow for an n-d array of probabilities and follow normal broadcasting rules with the additional requirement that the length of the probabilities array along axis must be 1. (For example, see how ttest_1samp handles broadcasting with popmean.) This allows the use of different probabilities for each slice or computation at all probabilites for all slices, depending on the alignment of the probability array. However, there is an improvement to be made. When keepdims=True, we can relax the rule that the length of the percentiles along axis must be 1, and we can accept an array of percentiles aligned along the dimension(s) specified by axis. The quantile is computed at all of those probabilites for the corresponding slice, and these quantiles are aligned along the axis dimension(s) of the output array. Compared to aligning the percentiles orthogonal to the input sample array, this has the advantage that each slice needs to be sorted (or partitioned) only once rather than once per percentile, and it offers the convenience of the existing NumPy interface. @seberg is this intelligible to you, at least, based on our conversation at the summit?
  • iqr
  • siegelslopes
  • theilslopes
  • median_test
  • median_abs_deviation
  • epps_singleton_2samp
  • levene (optional)
  • fligner (optional)
  • sen_seasonal_slopes

I wrote the following, so I'd prefer to do the upgrades on those personally.

Toward gh-22194, we'll be adding a few new functions to scipy.stats, and those should be array API compatible from the start:

After all that, it may be worth doing:

  • CensoredData
  • ecdf - relies on CensoredData. Might be worth doing after array API standard has diff with prepend, append. I wrote xp_diff in ENH: stats.rankdata: add array API standard support #20639, but it's slow.
  • logrank - relies on ecdf
  • bws_test - needs permutation_test
  • tukey_hsd - probably not too bad, but most of the time is calculating studentized_range SF. Could vectorize computation with _tanhsinh, though.
  • pointbiserialr - deprecate or implement using shortcut specific to binary data? It's just an alias for pearsonr right now.

I am not interested in working on or reviewing work on the following functions:

  • bayes_mvs (consider deprecating)
  • mvsdist (consider deprecating)
  • weightedtau (consider deprecating)
  • multiscale_graphcorr (consider deprecating)
  • tiecorrect (consider deprecating)
  • ranksums (consider deprecating)
  • somersd
  • page_trend_test
  • f_oneway
  • frequency statistics
    • cumfreq
    • percentileofscore
    • scoreatpercentile
    • relfreq
    • binned_statistic
    • binned_statistic_2d
    • binned_statistic_dd
  • plot tests
    • ppcc_max
    • ppcc_plot
    • probplot
    • boxcox_normplot
    • yeojohnson_normplot
  • Old univariate and multivariate distribution infrastructure (rv_continuous, rv_discrete, rv_histogram, etc.) and distributions

Some of the scipy.stats.contingency functions would be feasible to work on, but some would probably need to be vectorized for it to make sense

  • relative_risk
  • expected_freq
  • margins
  • chi2_contingency
  • association
  • fisher_exact - needs hypergeometric distribution
  • barnard_exact - uses shgo; probably not a good candidate
  • boschloo_exact - uses shgo; probably not a good candidate
  • odds_ratio - I don't think the cost/benefit ratio looks good
  • crosstab

I don't think these are good candidates for translation.

  • trim_mean - would benefit from array API partition
  • binomtest - could probably be made elementwise, but would be a bit of work
  • quantile_test - probably not bad, but needs binomial distribution functions (and inverses)
  • shapiro - compiled. I've written the Shapiro test in pure Python, but normal distribution order statistic stuff was computed via numerical integration, and p-value was just monte_carlo_test, so it's probably faster to convert to NumPy, perform the test, and convert back.
  • anderson - technically not too hard, but there are interface questions to be answered
  • cramervonmises - probably not too bad once we have array API distributions
  • ks_1samp - probably not too bad, but we need array API distributions and array API null distribution CDF/SF
  • kstest - dispatches to ks_1samp and ks_2samp; easy once those are done!
  • poisson_means_test - theoretically this could be an elementwise function, but implementing would be tricky because scalar arguments are used to create arange which would naturally lead to ragged arrays
  • dunnett - statistic is easy to vectorize; p-value is not.
  • scipy.stats.qmc - mostly compiled
  • sobol_indices - relies on scipy.stats.qmc
  • scipy.stats.sampling - all compiled. Long-term goal: rewrite vectorized versions in terms of scipy.interpolate.
  • scipy.stats.mstats - deprecate; accept marrays in regular stats functions.
  • fit. Relies heavily on differential_evolution, which would need array API support first.
  • wasserstein_distance_nd. Uses linear programming. Unlikely to get efficient array API support any time soon.
  • gaussian_kde. There could be an efficient array API implementation in terms of multivariate_normal if Covariance gets support for batch covariance matrices, which was drafted in ENH: stats.multivariate: introduce Covariance class and subclasses mdhaber/scipy#88. But it would be a complete rewrite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    array typesItems related to array API support and input array validation (see gh-18286)enhancementA new feature or improvementscipy.stats

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions