-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
WIP: stats.masked_array: array API compatible masked arrays #20363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@rgommers I am not particularly interested in masked arrays, but if the alternative is for there to be |
Thanks for the question and prototype Matt. This went in a direction I didn't expect, so let me first summarize what I had in mind:
Now for this PR: I think you're starting yet another rewrite of masked arrays from scratch here? That's a large endeavor, and is not necessary from my point of view. For sure there'll be some potential users if it's done right, but I don't think the priority of doing so is high. And if this does happen, it should not live inside SciPy. It'd be a new thing anyway, so it's not that relevant then to how we'd treat disclaimer: I wrote this in all of 10 minutes, it's very possible I missed something important here. |
I think the point of this PR is to suggest that rewriting masked arrays does not need to be a large endeavor. There are probably bugs in this PR, but I think you'll find that it implements masked versions of almost all array API functionality with light wrappers around the functions of the provided backend. The code is deceptively short because most of the wrappers can be shared by many functions with compatible signatures. (Admittedly, 90% of the work - testing and documentation - is left, but I still think it shows that the implementation need not be a big undertaking.) I think in most cases, the performance gain of a low-level implementation is unnecessary (or it might not be substantial) and the ecosystem would do just fine with an implementation of masked arrays composed only of high-level array API functionality.
It might not require that, though. What about something like this PR that implements masked versions of array API functionality for any array API backend, not just NumPy? The approach - which masked arrays might already use, but I didn't look - is not to skip the calculations on masked elements, but to replace masked elements with values that don't affect the result of the operation. For example, before performing |
It may well be true that it's less work than I think. And this code does look quite clean. Supporting only the array API standard is certainly easier than all of
>>> ma = masked_array(np)
>>> x = ma.asarray([1, 2, 3], mask=[False, False, True])
>>> x
masked_array(data=[1, 2, --],
mask=[False, False, True],
fill_value=999999)
>>> np.asarray(x)
array([1, 2, 3])
This is a new array type/library - could be cool, and if it supports |
Yup, I mentioned this in the top post. The simple solution to both issues is just not trying to be a subclass. There are only a few array API features that this is inheriting directly from the underlying type, and those are easy to replace.
Much harder or impossible? I expected impossible. Seems like an inherent conflict. But I think it's exactly the same problem as the fact that old code written with NumPy in mind doesn't immediately work with other array backends. We are in the process of converting old NumPy code to be array API compatible. After that happens, these masjed arrays would immediately "work" with that code - there would just be a few reasons why the results would not be correct. The main thing is that we commonly use
I don't either? I didn't mean to write anything that suggests that we'd increase support for
I suppose I interpreted your comments about masked arrays in gh-18286 to be in the broader sense of masked arrays, not I submitted this as a PR to SciPy because it could conceivably be within SciPy's purview as a type of data structure, and the fact that |
Yes, that makes sense.
Mostly yes indeed. There will be some exceptions I think, like
I'll note that it will only work for pure Python code, which will be a small subset of all of SciPy. With some extra effort and going outside of the standard, it may be possible to support element-wise functions like we have in And it's not only compiled code. If you think about modules like
Impossible unless that package adds specific support for handling masked arrays somehow (separate code path).
Okay fair enough. My comments in gh-18286 were only about A new masked array library could gain traction, but I would not try to include it in SciPy - it is better for it to be its own thing.
NumPy's stance on this effective has been to punt on this and to leave generic missing value support to dataframe libraries. That could change of course, and if someone wanted to rewrite |
Ok, thanks for the the thoughts. Re: pure Python code, IIUC, that is assumed for array API code in general; for code like I don't know of anyone else in SciPy who's interested enough, so I'll go ahead and close it here. We'll see if NumPy would consider it, and if not, maybe I'll release it separately or just use it privately in SciPy for |
This is where I am still missing something. I looked at the implementation of the machinery in My expectation was that you could mostly do |
We can, and that's what we've planned on, but as we transition to array API, I thought it would be nice to have something that's more performant. Currently, it loops over each axis-slice to support |
Ah okay, that is the context I was missing, thanks. If this can be a performance boost then I have no concerns if it lives as private machinery in SciPy somewhere. Or is a vendored copy of an independent package. |
Reference issue
gh-18286
What does this implement/fix?
Presumably the ecosystem will still have a need for masked arrays once we turn on array API support by default. This is a draft of a function that accepts an array API namespace and returns a corresponding array API compatible(ish?) masked array namespace.
Additional information
Do we want this in SciPy, or should this be a separate library?
This works with NumPy and CuPy (except for
__repr__
and__str__
) right now, but the only difference for other backends would probably be how the array is subclassed. Perhaps these should not be subclasses of the underlying arrays at all (because results would silently be wrong if they are implicitly treated as regular arrays), but that is easy enough to change.Most features are drafted; all need tests.
masked_array
function needs documentation, and docstrings need to be attached to returned attributes. There are a few decisions to be made about what the corresponding masked array behavior should be; I'll list these later.