-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
ENH: Improve performance of np.count_nonzero for float arrays #27523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
close/reopen for testing. |
@tylerjereddy Would you me able to review this PR? (there is also #27519 which is related) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add asv
benchmarks to guard the performance improvement? When I run asv continuous -e -b "time_count_nonzero.*" main count_nonzero_float
locally, I don't see any sign of improvements, probably because I don't see any benchmark parametrization over float types, so maybe we could include them.
The floating code path might be less common than the integer code paths since checking if a float is exactly zero may be a bit more annoying? I suppose that might be a point of potential resistance from the other core devs? Though the shims seem fairly small-ish.
switch(dtype_num) { | ||
/**begin repeat | ||
* | ||
* #dtype = npy_bool, npy_byte, npy_byte, npy_uint16, npy_int16, npy_uint32, npy_int32, npy_uint64, npy_int64, npy_float, npy_double# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can purge this line since there is no substitution target for dtype
in the repeated block? At least when I do that locally a clean build from source still passes the full suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, thanks.
Please add a test for byte swapped floats (or point to one if it is fine). The code doesn't look like it deals with that correctly (it is completely fine to just take the slow path for all byte-swapped dtypes, although it is OK for integers of course).
/**end repeat**/ | ||
} | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the only reason for putting it here is that the other file isn't c.src
(or .cpp
for that matter).
(OK with me)
} | ||
/**end repeat**/ | ||
|
||
npy_intp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npy_intp | |
NPY_NO_EXPORT npy_intp |
not sure we fixed that... Once that happens we should just remove this throughout (rather than stop adding it), IMO.
I forgot to point out that the important thing about byte swapping is a byte-swapped |
@@ -2724,6 +2724,15 @@ PyArray_CountNonzero(PyArrayObject *self) | |||
} | |||
} | |||
else { | |||
/* Special low-overhead version specific to the float types (and some others) */ | |||
if (PyArray_ISNOTSWAPPED(self)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized (for reasons): Basically every time you check something like this you also need to check whether the array is aligned. The above code does that, as you can see.
(Please also check in the other PR. One way to test in CI could be to use a "?d"
structured dtype, another to just view
with a byte shift. The structured dtype just doesn't check if there is a contiguous only special loop.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should replace PyArray_ISNOTSWAPPED(self)
with PyArray_ISALIGNED(self) && PyArray_ISNOTSWAPPED(self)
? What would be the reason for this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dereferencing an unaligned pointer like *(float64 *)ptr
will just kill with a SIGBUS or something on come hardware (it may be slow on others).
It isn't a problem on typical hardware...
Dunno, I think some snaitizer (UBSAN?) would find it, otherwise the typical thing is that debian opens an issue eventually because there is a SIGBUS on sparc or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: the behavior for unaligned dereferencing is indeed unaligned. See C17 [ISO/IEC 9899:2018] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf section 6.3.2.7
numpy/_core/tests/test_numeric.py
Outdated
|
||
def test_count_nonzero_non_aligned_array(self): | ||
sz = 64 | ||
b = np.zeros(64 + 1).view(np.int8)[1:-(np.intp(0).itemsize - 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intp
itemsize != float64 itemsize. Just hardcode the 8/7 bytes, seems easier to read. Can also create the int8/uint8
array initially directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's put this in, thanks. Looks right now, and clean enough. Thanks for the new tests also, which are important!
I still feel it would be nice to merge this with the bool/int "simple path" above, although it is slightly different I suppose (but it may actually be faster for the trivially iterable cases, although it may be that we are calling sum()
at least for bools there for exactly this optimization).
Improve performance of
np.count_nonzero
for float arays. The PR is similar to #18183, but uses a simple loop instead of SIMD instructions.arraytypes.c.src
has more types than is strictly required (the integers have their own fast path), these can be removed if needed.Quick benchmark:
Main:
PR