Skip to content

NEON version is only 26% faster than portable on Raspberry Pi 4 #310

@1f604

Description

@1f604

Hi all,

I compiled the example.c with and without NEON support on my Raspberry Pi 4 and got these results (using the same 2GB test file):

  • sha1sum file: 12.7s
  • sha256sum file: 18.9s
  • cat file | example program portable: 12s
  • cat file | example program NEON: 9.5s
  • md5sum file: 9.8s
  • xxhsum file: 1.9s
  • cat file | xxhsum: 4s

I also installed Rust and b3sum and got these results:

  • b3sum 1 thread no mmap: 8s
  • b3sum 4 threads no mmap: 8s
  • b3sum 1 thread: 7.9s
  • b3sum 2 threads: 4s
  • b3sum 3 threads: 2.7s
  • b3sum 4 threads: 2s
  • b3sum 16 threads: 2s
  • cat file | b3sum: 10s

The running time is clearly not IO dominated since xxhash only took 2 seconds to hash while the NEON-compiled example took 9.5 seconds. Okay, so piping the file into b3sum instead of just calling b3sum file adds 2s to the running time. But even if we shave off 2 seconds due to piping in to stdin, it's clear that most of the time is spent in the CPU rather than IO.

So the results show that the NEON version of BLAKE3 is only about 26% faster than the portable version.

I don't understand why compiling with and without NEON doesn't seem to make that much of a difference.

I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?

Maybe it is due to GCC producing bad NEON code? Is there an assembly version?

I am using GCC 10.2.1.

Thanks a lot!

EDIT: Compiling with clang 11.0.1-2 instead of GCC improved performance by about 7% (9.5s -> 8.9s average). I did not notice a difference after PGO with either GCC or clang.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions