Skip to content

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Mar 31, 2024

Reference issue

gh-18286

What does this implement/fix?

Presumably the ecosystem will still have a need for masked arrays once we turn on array API support by default. This is a draft of a function that accepts an array API namespace and returns a corresponding array API compatible(ish?) masked array namespace.

from scipy._lib.array_api_compat import numpy as xp
from scipy.stats import masked_array
ma = masked_array(xp)
A = ma.eye(3)
A.mask[0, ...] = True
x = ma.asarray([1, 2, 3], mask=[False, False, True])
A @ x
# masked_array(data=[--, 2.0, 0.0],
#              mask=[ True, False, False],
#        fill_value=1e+20)

from scipy._lib.array_api_compat import cupy as xp
ma = masked_array(xp)
A = ma.asarray(A, mask=A.mask)  # If we hadn't changed the underlying array type,
x = ma.asarray(x, mask=x.mask)  # we wouldn't need to pass the mask separately
b = A @ x
# __str__ and __repr__ don't work for CuPy yet
b.get()  # array([0., 2., 0.])
b.mask  # array([ True, False, False])
type(b.mask)  # cupy.ndarray

Additional information

Do we want this in SciPy, or should this be a separate library?

This works with NumPy and CuPy (except for __repr__ and __str__) right now, but the only difference for other backends would probably be how the array is subclassed. Perhaps these should not be subclasses of the underlying arrays at all (because results would silently be wrong if they are implicitly treated as regular arrays), but that is easy enough to change.

Most features are drafted; all need tests. masked_array function needs documentation, and docstrings need to be attached to returned attributes. There are a few decisions to be made about what the corresponding masked array behavior should be; I'll list these later.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Mar 31, 2024
@github-actions github-actions bot added the Meson Items related to the introduction of Meson as the new build system for SciPy label Mar 31, 2024
@lucascolley lucascolley removed the Meson Items related to the introduction of Meson as the new build system for SciPy label Mar 31, 2024
@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2024

@rgommers I am not particularly interested in masked arrays, but if the alternative is for there to be scipy.stats clones that add support, I would prefer to support them natively. In gh-18286, it sounded like the plan did not currently involve masked arrays, but mostly because it was a complication that could be put off until later. Is now a good time to discuss it? If so, what are your thoughts on whether SciPy should provide a backend-agnostic implementation (e.g. this PR)?

@rgommers
Copy link
Member

Thanks for the question and prototype Matt. This went in a direction I didn't expect, so let me first summarize what I had in mind:

  • stats.mstats functions accept numpy.ma.MaskedArray instances, and that should stay unchanged by design
  • stats functions (and functions in other scipy submodules) may accidentally support numpy.ma.MaskedArray instances too, but that is very patchy and accidental / mostly untested.
    • This can be deprecated/removed, because partial support and partially functions that just silently return the wrong results is a bad user experience.
    • This can be done even when the actual implementation of some function is shared between stats and mstats. Such a situation just requires a very thin public function that validates input arrays as appropriate, and then calls the shared private implementation.
  • NumPy's masked array support is in a poor state, and I certainly had not thought about extending its coverage in SciPy or adding support for other array libraries to it. It requires finishing the full rewrite that is floating around somewhere at ~90% completeness.

Now for this PR: I think you're starting yet another rewrite of masked arrays from scratch here? That's a large endeavor, and is not necessary from my point of view. For sure there'll be some potential users if it's done right, but I don't think the priority of doing so is high. And if this does happen, it should not live inside SciPy. It'd be a new thing anyway, so it's not that relevant then to how we'd treat np.ma.MaskedArray instances.

disclaimer: I wrote this in all of 10 minutes, it's very possible I missed something important here.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2024

Now for this PR: I think you're starting yet another rewrite of masked arrays from scratch here? That's a large endeavor, and is not necessary from my point of view.

I think the point of this PR is to suggest that rewriting masked arrays does not need to be a large endeavor. There are probably bugs in this PR, but I think you'll find that it implements masked versions of almost all array API functionality with light wrappers around the functions of the provided backend. The code is deceptively short because most of the wrappers can be shared by many functions with compatible signatures. (Admittedly, 90% of the work - testing and documentation - is left, but I still think it shows that the implementation need not be a big undertaking.) I think in most cases, the performance gain of a low-level implementation is unnecessary (or it might not be substantial) and the ecosystem would do just fine with an implementation of masked arrays composed only of high-level array API functionality.

It requires finishing the full rewrite that is floating around somewhere at ~90% completeness.

It might not require that, though. What about something like this PR that implements masked versions of array API functionality for any array API backend, not just NumPy?

The approach - which masked arrays might already use, but I didn't look - is not to skip the calculations on masked elements, but to replace masked elements with values that don't affect the result of the operation. For example, before performing sum, replace masked elements with 0, then perform the sum on everything.

@rgommers
Copy link
Member

It may well be true that it's less work than I think. And this code does look quite clean. Supporting only the array API standard is certainly easier than all of numpy.ma. That said, a few things that immediately stood out to me:

  • Subclassing is not supported by the array API standard, nor by all libraries. NumPy/CuPy allow it; PyTorch does too (but through a very different mechanism), JAX does not AFAIK.
  • One of the largest problems of numpy's masked arrays is correctness bugs when the mask gets silently dropped - this prototype has the same issue:
>>> ma = masked_array(np)
>>> x = ma.asarray([1, 2, 3], mask=[False, False, True])
>>> x
masked_array(data=[1, 2, --],
             mask=[False, False,  True],
       fill_value=999999)
>>> np.asarray(x)
array([1, 2, 3])
  • Avoiding that mask dropping makes it much harder to work with any libraries that aren't fully compatible with the array API standard, as well as with libraries having compiled code internally (including SciPy).

It might not require that, though. What about something like this PR that implements masked versions of array API functionality for any array API backend, not just NumPy?

This is a new array type/library - could be cool, and if it supports __array_namespace__ then some subset of SciPy may work with it out of the box even (I expect that to be a small-ish fraction though). I don't see how that changes things for SciPy's treatment of numpy.ma.MaskedArray instances however? These two topics are orthogonal.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 10, 2024

Subclassing is not supported by the array API standard, nor by all libraries...One of the largest problems of numpy's masked arrays is correctness bugs when the mask gets silently dropped

Yup, I mentioned this in the top post. The simple solution to both issues is just not trying to be a subclass. There are only a few array API features that this is inheriting directly from the underlying type, and those are easy to replace.

Avoiding that mask dropping makes it much harder to work with any libraries that aren't fully compatible with the array API standard,

Much harder or impossible? I expected impossible. Seems like an inherent conflict.

But I think it's exactly the same problem as the fact that old code written with NumPy in mind doesn't immediately work with other array backends. We are in the process of converting old NumPy code to be array API compatible. After that happens, these masjed arrays would immediately "work" with that code - there would just be a few reasons why the results would not be correct. The main thing is that we commonly use shape to count the number of elements in each axis-slice, and we assume it's the same for all slices. We need to adjust that logic with a count of the number of non-masked elements, but that's pretty easy. We might find other cases where modifications are necessary, but that is the thing that comes to mind.

I don't see how that changes things for SciPy's treatment of numpy.ma.MaskedArray instances however?

I don't either? I didn't mean to write anything that suggests that we'd increase support for numpy.ma.MaskedArray. I'm just proposing the idea of:

  • backend agnostic array API compatible masked arrays like in this PR, whether in SciPy or released as a separate library
  • support for these (not np.ma.MaskedArray or a rewrite of it) in SciPy stats as the way forward for masked arrays and potentially nan_policy support.

I suppose I interpreted your comments about masked arrays in gh-18286 to be in the broader sense of masked arrays, not np.ma.MaskedArray in particular.

I submitted this as a PR to SciPy because it could conceivably be within SciPy's purview as a type of data structure, and the fact that stats might be a significant user of it suggests that it might be nice for SciPy to have control over it. We could ensure that it allows us to implement nan_policy='omit' for non-NumPy backends, for instance. That said, with control comes responsibility, so I can see why SciPy might not want to assume the burden.

@rgommers
Copy link
Member

Yup, I mentioned this in the top post. The simple solution to both issues is just not trying to be a subclass.

Yes, that makes sense.

There are only a few array API features that this is inheriting directly from the underlying type, and those are easy to replace.

Mostly yes indeed. There will be some exceptions I think, like __dlpack__ can't work unless there's no mask. And the .shape thing isn't that easy I think, because what you get instead are ragged arrays.

We are in the process of converting old NumPy code to be array API compatible. After that happens, these masjed arrays would immediately "work" with that code - there would just be a few reasons why the results would not be correct

I'll note that it will only work for pure Python code, which will be a small subset of all of SciPy. With some extra effort and going outside of the standard, it may be possible to support element-wise functions like we have in scipy.special. But that's about it.

And it's not only compiled code. If you think about modules like signal and ndimage, those inherently are assuming regularly-spaced data. You can't support masked arrays in that type of code without thinking really hard about infill strategies (Pandas time series functionality supports that kind of thing).

Much harder or impossible? I expected impossible. Seems like an inherent conflict.

Impossible unless that package adds specific support for handling masked arrays somehow (separate code path).

I don't either? I didn't mean to write anything that suggests that we'd increase support for numpy.ma.MaskedArray. I'm just proposing the idea of:
...
I suppose I interpreted your comments about masked arrays in gh-18286 to be in the broader sense of masked arrays, not np.ma.MaskedArray in particular.

Okay fair enough. My comments in gh-18286 were only about numpy.ma.MaskedArray, because that's all we have today and need to worry about in Scipy backwards-compat wise.

A new masked array library could gain traction, but I would not try to include it in SciPy - it is better for it to be its own thing.

Presumably the ecosystem will still have a need for masked arrays once we turn on array API support by default.

NumPy's stance on this effective has been to punt on this and to leave generic missing value support to dataframe libraries. That could change of course, and if someone wanted to rewrite numpy.ma that'd be great (inside numpy itself or in a separate package) - but I think you're the first one to put real work into it in a long time.

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 11, 2024

Ok, thanks for the the thoughts. Re: pure Python code, IIUC, that is assumed for array API code in general; for code like signal/ndimage, sure, missing data presents inherent difficulty to those subpackages. I'm mostly trying to find a better way for stats to continue providing nan_policy='omit' and masked array support as we switch to array API; the rest is icing o the cake.

I don't know of anyone else in SciPy who's interested enough, so I'll go ahead and close it here. We'll see if NumPy would consider it, and if not, maybe I'll release it separately or just use it privately in SciPy for nan_policy='omit'.

@mdhaber mdhaber closed this Apr 11, 2024
@rgommers
Copy link
Member

I'm mostly trying to find a better way for stats to continue providing nan_policy='omit' and masked array support as we switch to array API

This is where I am still missing something. I looked at the implementation of the machinery in stats/_axis_nan_policy.py, and it does not use numpy.ma directly at the moment. There is a bit of code that explicitly checks for a mask attribute, but not much else - the implementation still uses regular numpy functions, and seems implementable in terms of functions in the array API standard. There's a couple of functions used now that aren't in it though, like np.insert and np.delete. Is that the problem here, you're trying to rewrite that and are missing some functionality?

My expectation was that you could mostly do np -> xp and it'd work. And np.ma.MaskedArray support would continue to have a separate code path (and a separate module in stats.mstats).

@mdhaber
Copy link
Contributor Author

mdhaber commented Apr 11, 2024

My expectation was that you could mostly do np -> xp and it'd work.

We can, and that's what we've planned on, but as we transition to array API, I thought it would be nice to have something that's more performant. Currently, it loops over each axis-slice to support nan_policy='omit'. That is fine when compared against the old masked array implementations, but it is pretty wasteful if we're working on a GPU-based array.

@rgommers
Copy link
Member

Ah okay, that is the context I was missing, thanks. If this can be a performance boost then I have no concerns if it lives as private machinery in SciPy somewhere. Or is a vendored copy of an independent package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants