-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Add SSE4 optimized SHA256 #10821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSE4 optimized SHA256 #10821
Conversation
src/crypto/sha256.cpp
Outdated
|
||
#if defined(__x86_64__) || defined(__amd64__) | ||
uint32_t eax, ebx, ecx, edx; | ||
if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 20) & 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to do this setup explicitly during initialization; this also avoids having to use an atomic pointer, which seems overkill (why would it ever change during runtime?) and may be inefficient on some platforms.
(also the detection might be more involved on some platforms, so it's better for clarity to drive it from an init function instead of magically at first call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have the option of using the ifunc attribute, supported on recent binutils with at least gcc and clang.
Though it's non-standard and afaik elf-specific, it's worth considering where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have constructors with hashing in them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laanwj Fixed.
Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it.. |
@luke-jr There are system macros to test whether you're compiling for x86_64 or not. |
You said almost every x86_64 CPU. Are we going to drop support for the outliers then? |
One of the travis builds obviously has an issue with it too: |
The clang/osx build succeeds when -fomit-frame-pointer is used. I don't speak enough asm to know if a register can be freed up. |
No it won't-- these files are compiled without -msse4.2 already. The only thing required is that its x86_64, which the build tests for. |
@luke-jr There is runtime detection to see if the CPU supports the extension. The only requirement is that the target is x86_64. |
Gitian OSX build is broken (https://bitcoin.jonasschnelli.ch/build/216):
No problem on Win/ |
@jonasschnelli @theuni figured it out - clang isn't compiling with |
Updated the code to use one fewer register. The original YASM code used the |
423f30d
to
dc1fa84
Compare
src/crypto/sha256_sse42.cpp
Outdated
; documentation and/or other materials provided with the | ||
; distribution. | ||
; | ||
; * Neither the name of the Intel Corporation nor the names of its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're gonna have to do something to meet this condition, though it doesnt appear we'd have to do much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the standard three clause BSD license, it is GPL and whatnot compatible. The source code to Bitcoin, which contains this notice, is part of the "documentation and/or other materials" we provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ship sans-source all the time? I figured we'd just put a "contains softare copyright Intel" in the --help output or a README somewhere.
Marking as WIP, as this does not seem to produce correct hashes on OSX (cc @theuni). |
I poked at this for hours and came up empty-handed. I'll wait for someone else to confirm my osx breakage isn't just local. |
two more data points:
|
db8ef97
to
08b7438
Compare
Tested ACK 08b7438f73236fc738fb655f766e77a81e6b7311. Good on OSX now! Edit: Though I'd prefer to have the cpu check done separately. |
Removing WIP tag, I believe we solved the OSX problem. |
utACK 6b8d872, though I extensively tested earlier revisions. |
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
Port of Core PR #10821: Add SSE4 optimized SHA256
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
Backport of Core #10821 and #11176
For future reference, as of #11176 this is now enabled by default. |
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on #13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to #13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-#10821): 6.1ms * Using SSE4 (master, #10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, #10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, #10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
Signed-off-by: Pasta <pasta@dashboost.org>
…ions 66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on bitcoin#13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to bitcoin#13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-bitcoin#10821): 6.1ms * Using SSE4 (master, bitcoin#10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
…ions 66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on bitcoin#13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to bitcoin#13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-bitcoin#10821): 6.1ms * Using SSE4 (master, bitcoin#10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
Reference: bitcoin#10821
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille) fa9be90 Add selftest for SHA256 transform (Pieter Wuille) c1ccb15 Add SSE4 based SHA256 (Pieter Wuille) 2991c91 Add SHA256 dispatcher (Pieter Wuille) 4d50f38 Support multi-block SHA256 transforms (Pieter Wuille) Pull request description: This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`. In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax. This gives around a 50% speedup on the SHA256 benchmark for me. It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency. Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
Signed-off-by: Pasta <pasta@dashboost.org>
…ions 66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on bitcoin#13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to bitcoin#13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-bitcoin#10821): 6.1ms * Using SSE4 (master, bitcoin#10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with
--enable-experimental-asm
.In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.
This gives around a 50% speedup on the SHA256 benchmark for me.
It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.