-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
scipy.stats.mstats
is a mostly-separate re-implementation of scipy.stats
with support for NumPy masked arrays. Masked values are treated as missing: for 1-D slices, the result is typically the same as if the masked value were not present.
While there seems to be demand for statistical functions to support missing values, I'd suggest that having two separate implementations of these functions is not the best way to satisfy the need.
- Maintaining two implementations is approximately twice the work of maintaing a single implementation that supports masked and non-masked arrays.
- The two implementations have fallen out of sync and will inevitably continue to do so. This would seem to introduce an unfortunate choice between the more capable
scipy.stats
function or the masked capabilities of itsscipy.stats.mstats
counterpart1.
I have seen the opinion that we can combine the implementations but must maintain a separate scipy.stats.mstats
namespace. While this does not double the workload, maintaining two interfaces is more work than maintaining one. For instance, many scipy.stats.mstats
are missing "Returns" (#22065 (comment)) and "Examples" (gh-7168) sections of their documentation. Also, having separate interfaces for essentially identical functionality is unnecessarily complicated for users.
I see two other reasons why a namespace should not be devoted to NumPy masked arrays.
- While not actually deprecated, NumPy masked arrays themselves are problematic and most unmaintained.
- NumPy masked arrays are not compatible with the Python Array API and are explicitly rejected by our
array_namespace
function.
Fortunately, many scipy.stats
functions already offer the same functionality as their scipy.stats.mstats
counterparts, making the separate namespace redundant. There are actually two obvious ways 2 to ignore missing values in most scipy.stats
functions with a scipy.stats.mstats
counterpart:
- Replace the masked values with
nan
and usenan_policy='omit'
. - Simply pass the masked array to the
scipy.stats
function. This behavior has been handled by the_axis_nan_policy
decorator for several years.
Both of these avoid a common pitfall of NumPy masked arrays, which mask non-finite values that arise during calculations. This behavior is problematic because NaNs and infinities should not always be treated the same as missing data.
Update April 2025: The specific plan suggested here has changed; see #22194 (comment) for an update.
Here is the proposed alternative:
- Decide which
scipy.stats.mstats
functions to add toscipy.stats
, and add them. These new functions should be subject to the same level of review as any other newscipy.stats
function, as standards have changed since they were introduced tomstats
. - Ensure that all
scipy.stats
functions with ascipy.stats.mstats
counterpart support the following:- NumPy arrays with NaNs and
nan_policy='omit'
(always) - NumPy masked arrays (when
SCIPY_ARRAY_API
is unspecified)
- NumPy arrays with NaNs and
- Deprecate
scipy.stats.mstats
functions with a helpful warning for transitioning to thescipy.stats
equivalent. - Remove
scipy.stats.mstats
functions after the usual deprecation period. - Remove the
scipy.stats.mstats
namespace either a) along with the lastscipy.stats.mstats
functions (preferably) or b) in SciPy 2.0.0 (if it is necessary to wait for some reason).
Looking beyond this, I would also suggest that as scipy.stats
functions are translated to use the Python Array API, they can also be adapted to natively support marray
, which add masks to any Python Array API compatible backend. In most cases, the only special consideration for MArray
s is that the count of non-masked elements along axis
should be used in place of the length of the array along axis
.
Closing this will close gh-5474
Footnotes
-
If the
scipy.stats
version did not already support masked arrays - but many do. (Addressed below.) ↩ -
Ideally,
nan_policy='omit'
could also be eliminated, and the same behavior could be achieved by passing anMArray
(discussed below) to the function.MArray
s do not automatically mask non-finite values that arise during calculations. ↩