Skip to content

RFC: stats: sunsetting scipy.stats.mstats #22194

@mdhaber

Description

@mdhaber

scipy.stats.mstats is a mostly-separate re-implementation of scipy.stats with support for NumPy masked arrays. Masked values are treated as missing: for 1-D slices, the result is typically the same as if the masked value were not present.

While there seems to be demand for statistical functions to support missing values, I'd suggest that having two separate implementations of these functions is not the best way to satisfy the need.

  • Maintaining two implementations is approximately twice the work of maintaing a single implementation that supports masked and non-masked arrays.
  • The two implementations have fallen out of sync and will inevitably continue to do so. This would seem to introduce an unfortunate choice between the more capable scipy.stats function or the masked capabilities of its scipy.stats.mstats counterpart1.

I have seen the opinion that we can combine the implementations but must maintain a separate scipy.stats.mstats namespace. While this does not double the workload, maintaining two interfaces is more work than maintaining one. For instance, many scipy.stats.mstats are missing "Returns" (#22065 (comment)) and "Examples" (gh-7168) sections of their documentation. Also, having separate interfaces for essentially identical functionality is unnecessarily complicated for users.

I see two other reasons why a namespace should not be devoted to NumPy masked arrays.

  • While not actually deprecated, NumPy masked arrays themselves are problematic and most unmaintained.
  • NumPy masked arrays are not compatible with the Python Array API and are explicitly rejected by our array_namespace function.

Fortunately, many scipy.stats functions already offer the same functionality as their scipy.stats.mstats counterparts, making the separate namespace redundant. There are actually two obvious ways 2 to ignore missing values in most scipy.stats functions with a scipy.stats.mstats counterpart:

  • Replace the masked values with nan and use nan_policy='omit'.
  • Simply pass the masked array to the scipy.stats function. This behavior has been handled by the _axis_nan_policy decorator for several years.

Both of these avoid a common pitfall of NumPy masked arrays, which mask non-finite values that arise during calculations. This behavior is problematic because NaNs and infinities should not always be treated the same as missing data.


Update April 2025: The specific plan suggested here has changed; see #22194 (comment) for an update.

Here is the proposed alternative:

  • Decide which scipy.stats.mstats functions to add to scipy.stats, and add them. These new functions should be subject to the same level of review as any other new scipy.stats function, as standards have changed since they were introduced to mstats.
  • Ensure that all scipy.stats functions with a scipy.stats.mstats counterpart support the following:
    • NumPy arrays with NaNs and nan_policy='omit' (always)
    • NumPy masked arrays (when SCIPY_ARRAY_API is unspecified)
  • Deprecate scipy.stats.mstats functions with a helpful warning for transitioning to the scipy.stats equivalent.
  • Remove scipy.stats.mstats functions after the usual deprecation period.
  • Remove the scipy.stats.mstats namespace either a) along with the last scipy.stats.mstats functions (preferably) or b) in SciPy 2.0.0 (if it is necessary to wait for some reason).

Looking beyond this, I would also suggest that as scipy.stats functions are translated to use the Python Array API, they can also be adapted to natively support marray, which add masks to any Python Array API compatible backend. In most cases, the only special consideration for MArrays is that the count of non-masked elements along axis should be used in place of the length of the array along axis.


Closing this will close gh-5474

Footnotes

  1. If the scipy.stats version did not already support masked arrays - but many do. (Addressed below.)

  2. Ideally, nan_policy='omit' could also be eliminated, and the same behavior could be achieved by passing an MArray (discussed below) to the function. MArrays do not automatically mask non-finite values that arise during calculations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest for Comments; typically used to gather feedback for a substantial change proposaldeprecatedItems related to behavior that has been deprecatedscipy.stats

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions