Skip to content

ENH: Improve performance of np.count_nonzero for float arrays #27523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 22, 2025

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Oct 7, 2024

Improve performance of np.count_nonzero for float arays. The PR is similar to #18183, but uses a simple loop instead of SIMD instructions.

  • The template in arraytypes.c.src has more types than is strictly required (the integers have their own fast path), these can be removed if needed.

Quick benchmark:

import sys
import timeit
import numpy as np

b = np.arange(10_000).astype(bool)
x = b.astype(int)
f = b.astype(np.float64)
f32 = b.astype(np.float32)
n=10

for A in [b, x, f, f32]:
    t = timeit.timeit('np.count_nonzero(A)', globals=globals(), number=n)
    print(f'{A.dtype}: {1e6*t/n:.1f} [us]')

Main:

bool: 1.5 [us]
int64: 5.1 [us]
float64: 24.8 [us]
float32: 28.0 [us]

PR

bool: 1.5 [us]
int64: 5.1 [us]
float64: 7.8 [us]
float32: 7.7 [us]

@eendebakpt eendebakpt marked this pull request as draft October 7, 2024 11:49
@eendebakpt eendebakpt marked this pull request as ready for review October 7, 2024 17:05
@charris
Copy link
Member

charris commented Oct 8, 2024

close/reopen for testing.

@eendebakpt
Copy link
Contributor Author

eendebakpt commented Feb 10, 2025

@tylerjereddy Would you me able to review this PR? (there is also #27519 which is related)

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add asv benchmarks to guard the performance improvement? When I run asv continuous -e -b "time_count_nonzero.*" main count_nonzero_float locally, I don't see any sign of improvements, probably because I don't see any benchmark parametrization over float types, so maybe we could include them.

The floating code path might be less common than the integer code paths since checking if a float is exactly zero may be a bit more annoying? I suppose that might be a point of potential resistance from the other core devs? Though the shims seem fairly small-ish.

switch(dtype_num) {
/**begin repeat
*
* #dtype = npy_bool, npy_byte, npy_byte, npy_uint16, npy_int16, npy_uint32, npy_int32, npy_uint64, npy_int64, npy_float, npy_double#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can purge this line since there is no substitution target for dtype in the repeated block? At least when I do that locally a clean build from source still passes the full suite.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks.

Please add a test for byte swapped floats (or point to one if it is fine). The code doesn't look like it deals with that correctly (it is completely fine to just take the slow path for all byte-swapped dtypes, although it is OK for integers of course).

/**end repeat**/
}
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the only reason for putting it here is that the other file isn't c.src (or .cpp for that matter).

(OK with me)

}
/**end repeat**/

npy_intp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npy_intp
NPY_NO_EXPORT npy_intp

not sure we fixed that... Once that happens we should just remove this throughout (rather than stop adding it), IMO.

@seberg
Copy link
Member

seberg commented Feb 11, 2025

I forgot to point out that the important thing about byte swapping is a byte-swapped -0.0 not a 0, since it actually is all zero.

@@ -2724,6 +2724,15 @@ PyArray_CountNonzero(PyArrayObject *self)
}
}
else {
/* Special low-overhead version specific to the float types (and some others) */
if (PyArray_ISNOTSWAPPED(self)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized (for reasons): Basically every time you check something like this you also need to check whether the array is aligned. The above code does that, as you can see.

(Please also check in the other PR. One way to test in CI could be to use a "?d" structured dtype, another to just view with a byte shift. The structured dtype just doesn't check if there is a contiguous only special loop.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should replace PyArray_ISNOTSWAPPED(self) with PyArray_ISALIGNED(self) && PyArray_ISNOTSWAPPED(self)? What would be the reason for this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dereferencing an unaligned pointer like *(float64 *)ptr will just kill with a SIGBUS or something on come hardware (it may be slow on others).
It isn't a problem on typical hardware...

Dunno, I think some snaitizer (UBSAN?) would find it, otherwise the typical thing is that debian opens an issue eventually because there is a SIGBUS on sparc or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: the behavior for unaligned dereferencing is indeed unaligned. See C17 [ISO/IEC 9899:2018] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf section 6.3.2.7


def test_count_nonzero_non_aligned_array(self):
sz = 64
b = np.zeros(64 + 1).view(np.int8)[1:-(np.intp(0).itemsize - 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intp itemsize != float64 itemsize. Just hardcode the 8/7 bytes, seems easier to read. Can also create the int8/uint8 array initially directly.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's put this in, thanks. Looks right now, and clean enough. Thanks for the new tests also, which are important!

I still feel it would be nice to merge this with the bool/int "simple path" above, although it is slightly different I suppose (but it may actually be faster for the trivially iterable cases, although it may be that we are calling sum() at least for bools there for exactly this optimization).

@seberg seberg merged commit e68ac76 into numpy:main Apr 22, 2025
71 of 72 checks passed
@seberg seberg added this to the 2.3.0 release milestone Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants