-
Notifications
You must be signed in to change notification settings - Fork 400
Description
Hi all,
I compiled the example.c with and without NEON support on my Raspberry Pi 4 and got these results (using the same 2GB test file):
- sha1sum file: 12.7s
- sha256sum file: 18.9s
- cat file | example program portable: 12s
- cat file | example program NEON: 9.5s
- md5sum file: 9.8s
- xxhsum file: 1.9s
- cat file | xxhsum: 4s
I also installed Rust and b3sum and got these results:
- b3sum 1 thread no mmap: 8s
- b3sum 4 threads no mmap: 8s
- b3sum 1 thread: 7.9s
- b3sum 2 threads: 4s
- b3sum 3 threads: 2.7s
- b3sum 4 threads: 2s
- b3sum 16 threads: 2s
- cat file | b3sum: 10s
The running time is clearly not IO dominated since xxhash only took 2 seconds to hash while the NEON-compiled example took 9.5 seconds. Okay, so piping the file into b3sum instead of just calling b3sum file adds 2s to the running time. But even if we shave off 2 seconds due to piping in to stdin, it's clear that most of the time is spent in the CPU rather than IO.
So the results show that the NEON version of BLAKE3 is only about 26% faster than the portable version.
I don't understand why compiling with and without NEON doesn't seem to make that much of a difference.
I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?
Maybe it is due to GCC producing bad NEON code? Is there an assembly version?
I am using GCC 10.2.1.
Thanks a lot!
EDIT: Compiling with clang 11.0.1-2 instead of GCC improved performance by about 7% (9.5s -> 8.9s average). I did not notice a difference after PGO with either GCC or clang.