Skip to content

Conversation

sipa
Copy link
Member

@sipa sipa commented May 8, 2018

This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:

  • Generic C++ (reusing the normal SHA256 code)
  • Specialized C++ for 64-byte inputs, but no special instructions
  • 4-way using SSE4.1 intrinsics
  • 8-way using AVX2 intrinsics

On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):

  • 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
  • 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
  • 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
  • 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
  • 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)

@sipa sipa force-pushed the 201709_dsha256_64 branch 8 times, most recently from 7207c42 to 6904dc1 Compare May 8, 2018 22:53
@laanwj
Copy link
Member

laanwj commented May 9, 2018

Looks like these whole chains of functions are unused after this, except in the merkle tests:

BlockMerkleBranch → ComputeMerkleBranch → MerkleComputation
ComputeMerkleRootFromBranch

Might want to move some functions there.

h = Add(t1, t2);
}

__m256i inline Read8(const unsigned char* chunk, int offset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read8 and Write8 appear to read and write values respectively in opposite order, I suppose this is intentional?

unrelated: I also wonder if this could be done with a parallel instruction instead of calling into ReadBE32/WriteBE32 for each component, as we know the host endianness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Indeed, there exist byte-shuffle intrinsics for SSE4 and AVX2; I've used those instead of individual byteswaps.

@sipa sipa force-pushed the 201709_dsha256_64 branch from 6904dc1 to b36eac0 Compare May 9, 2018 15:27
@@ -52,6 +52,14 @@ static void SHA256_32b(benchmark::State& state)
}
}

static void DSHA256_64b(benchmark::State& state)
{
std::vector<uint8_t> in(64 * 1024,0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a space between "1024," and "0"?

@theuni
Copy link
Member

theuni commented May 9, 2018

Concept ACK!

@sipa Please see the build-system comments on #13203, I didn't realize this was a different PR. I'll follow-up here for non-power changes.

#if defined(USE_ASM) && (defined(__x86_64__) || defined(__amd64__))
uint32_t eax, ebx, ecx, edx;
if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 19) & 1) {
Transform = sha256_sse4::Transform;
TransformD64 = TransformD64Wrapper<sha256_sse4::Transform>;
assert(SelfTest(Transform));
return "sse4";
#if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because libbitcoinconsensus doesn't take advantage of 4way? Or some buildsystem limitation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both.

I didn't feel like creating copies of all the architecture-specialized libs for use within libbitcoinconsensus - especially as it doesn't benefit it.

@ghost
Copy link

ghost commented May 10, 2018

Great job!
Any plans to add support for AVX512?

I know there are not so many people with AVX512 CPU's but why not add it from now?

All Intel CPU's will have support for AVX512 in coming months (mainstream + server).

@ghost
Copy link

ghost commented May 10, 2018

@laanwj
Copy link
Member

laanwj commented May 10, 2018

I know there are not so many people with AVX512 CPU's but why not add it from now?

That is not how open source development works. A PR is for reviewing the code. Future improvements can be done in future PRs. For example @TheBlueMatt adds support for POWER8 instructions in #13203. I might add ARM intrinsics support at some point.

@sipa
Copy link
Member Author

sipa commented May 11, 2018

@laanwj The algorithm in MerkleComputation is actually potentially more efficient (better memory locality) than the one in ComputeMerkleRoot now, apart from the fact that it doesn't take advantage of multi-way hashing. I'd like to keep it around for a bit and see if I can adapt it to use multi-way instead, in which case it could be used for everything. I can also move it and move it back if used. What do you think?

@Kick1986 AVX512 is cool, but it's low-impact (even machines that support it right now do it with reduced clock rate), and I'm unable to benchmark it. Feel free to add in follow-up work yoursel, though.

@ghost
Copy link

ghost commented May 12, 2018

@sipa
Thanks for answering me!
You are right about AVX512 clock speed, it will go mainstream sometime 2020.
I will do my best to add AVX512 to BTC code before that.
Thanks

@sipa sipa force-pushed the 201709_dsha256_64 branch from b36eac0 to 6b90da8 Compare May 12, 2018 19:11
@sipa
Copy link
Member Author

sipa commented May 12, 2018

@laanwj I moved the unused Merkle branch functions to the test code. Also:

I might add ARM intrinsics support at some point.

👍 That sounds far more impactful than POWER8, to be honest ;)

@theuni I've addressed some of your build system comments from #13203, but left the ENABLE_AVX2 and ENABLE_SSE41 macros, because there isn't a clean platform independent way to use compiler defines. __AVX2__ exists in both GCC and MSVC, but there is no __SSE4_1__ in MSVC (and worse, there is no way to test for SSE4 at all there; you have to test whether the FP code is x87 based or SSE based).

@kristapsk Ok, I've added a space.

@gmaxwell
Copy link
Contributor

Concept ACK. Lightly tested ACK.

We should open a issue to track doing a specialized 1-way 64-byte SSE4 function for this later as that is a pretty much guaranteed performance gain (as the non-specialized 1-way SSE4 that does more work is faster than the specialized non-SSE4 1-way code) which can be done by someone who knows assembly but knows little about Bitcoin.

We might want to add a note to explain the specialization along the lines of: the 64-byte input (and 32-byte input in the second SHA256 invocation) mean that most of the input to the message expansion is zeros, which lets us drop out a lot of additions.

@sipa
Copy link
Member Author

sipa commented May 17, 2018

@gmaxwell Yes, perhaps. But the impact of that would be pretty low at best, as every system in which that optimized 1-way code can be used also supports 4-way 64-byte optimized code already; meaning such an implementation would only be used for up to 3 of the last hashes in each level of a Merkle tree.

@sipa
Copy link
Member Author

sipa commented May 20, 2018

@theuni Could you have a look at the build system changes again?

@theuni
Copy link
Member

theuni commented May 29, 2018

@sipa thanks for the fixups. utACK build-system changes.

@@ -131,9 +131,23 @@ static void MerkleComputation(const std::vector<uint256>& leaves, uint256* proot
}

uint256 ComputeMerkleRoot(const std::vector<uint256>& leaves, bool* mutated) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: pass leaves by value. It can be std::move'd from BlockMerkleRoot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@droark
Copy link
Contributor

droark commented Jun 6, 2018

After-the-fact tACK

The code's running nicely on my AVX2-enabled machine. Good job!

I do have one question. Is this going to affect Gitian builds for people who are on older machines? Maybe I'm missing something silly but it seems like this could cause mismatches. I remember that the SSE4-enabled SHA-256 code initially required --enable-experimental-asm and was eventually enabled by default.

Thanks.

@sipa
Copy link
Member Author

sipa commented Jun 6, 2018

@droark It does require an AVX2 compatible compiler, but GCC 4.7 is sufficient for that (which we already required for release builds). You can perfectly well build AVX2 code even if your own hardware doesn't support AVX2.

@droark
Copy link
Contributor

droark commented Jun 6, 2018

@sipa - Thanks. I had a bad understanding of compilers from long ago that I somehow never shook.

laanwj added a commit that referenced this pull request Jun 12, 2018
57ba401 Enable double-SHA256-for-64-byte code on 32-bit x86 (Pieter Wuille)

Pull request description:

  The SSE4 and AVX2 double-SHA256-for-64-byte input code from #13191 compiles fine on 32-bit x86 systems, but the autodetection logic in sha256.cpp doesn't enable it. Fix this.

  Note that these instruction sets are only available on CPUs that support 64-bit mode as well, so it is only beneficial in the (perhaps unlikely) scenario where a 64-bit CPU is running a 32-bit Bitcoin Core binary.

Tree-SHA512: 39d5963c1ba8c33932549d5fe98bd184932689a40aeba95043eca31dd6824f566197c546b60905555eccaf407408a5f0f200247bb0907450d309b0a70b245102
laanwj added a commit that referenced this pull request Jun 18, 2018
1e1eb63 Improve coverage of SHA256 SelfTest code (Pieter Wuille)

Pull request description:

  The existing SelfTest code does not cover the specialized double-SHA256-for-64-byte-inputs transforms added in #13191. Fix this.

Tree-SHA512: 593c7ee5dc9e77fc4c89e0a7753a63529b0d3d32ddbc015ae3895b52be77bee8a80bf16b754b30a22c01625a68db83fb77fa945a543143542bebb5b0f017ec5b
laanwj added a commit that referenced this pull request Jul 9, 2018
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on #13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to #13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-#10821): 6.1ms
  * Using SSE4 (master, #10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
@rollmeister
Copy link
Contributor

rollmeister commented Apr 13, 2019

Can you please provide any documentation as to how you calculated the custom constants in Transform 2? I would like to adapt it for a 112 byte input. Could it be made to work if Transform 1 uses the state calculated from the first 64 bytes and the remaining 48 bytes (with appropriate padding) as the input message digest along with said calculated custom constants for Transform 2?

codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…th SSE4.1 and AVX2

4defdfa [MOVEONLY] Move unused Merkle branch code to tests (Pieter Wuille)
4437d6e 8-way AVX2 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
230294b 4-way SSE4.1 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
1f0e7ca Use SHA256D64 in Merkle root computation (Pieter Wuille)
d0c9632 Specialized double sha256 for 64 byte inputs (Pieter Wuille)
57f3463 Refactor SHA256 code (Pieter Wuille)
0df0178 Benchmark Merkle root computation (Pieter Wuille)

Pull request description:

  This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:
  * Generic C++ (reusing the normal SHA256 code)
  * Specialized C++ for 64-byte inputs, but no special instructions
  * 4-way using SSE4.1 intrinsics
  * 8-way using AVX2 intrinsics

  On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):
  * 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
  * 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
  * 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
  * 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
  * 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)

Tree-SHA512: efa32d48b32820d9ce788ead4eb583949265be8c2e5f538c94bc914e92d131a57f8c1ee26c6f998e81fb0e30675d4e2eddc3360bcf632676249036018cff343e
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
1e1eb63 Improve coverage of SHA256 SelfTest code (Pieter Wuille)

Pull request description:

  The existing SelfTest code does not cover the specialized double-SHA256-for-64-byte-inputs transforms added in bitcoin#13191. Fix this.

Tree-SHA512: 593c7ee5dc9e77fc4c89e0a7753a63529b0d3d32ddbc015ae3895b52be77bee8a80bf16b754b30a22c01625a68db83fb77fa945a543143542bebb5b0f017ec5b
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
57ba401 Enable double-SHA256-for-64-byte code on 32-bit x86 (Pieter Wuille)

Pull request description:

  The SSE4 and AVX2 double-SHA256-for-64-byte input code from bitcoin#13191 compiles fine on 32-bit x86 systems, but the autodetection logic in sha256.cpp doesn't enable it. Fix this.

  Note that these instruction sets are only available on CPUs that support 64-bit mode as well, so it is only beneficial in the (perhaps unlikely) scenario where a 64-bit CPU is running a 32-bit Bitcoin Core binary.

Tree-SHA512: 39d5963c1ba8c33932549d5fe98bd184932689a40aeba95043eca31dd6824f566197c546b60905555eccaf407408a5f0f200247bb0907450d309b0a70b245102
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…th SSE4.1 and AVX2

4defdfa [MOVEONLY] Move unused Merkle branch code to tests (Pieter Wuille)
4437d6e 8-way AVX2 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
230294b 4-way SSE4.1 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
1f0e7ca Use SHA256D64 in Merkle root computation (Pieter Wuille)
d0c9632 Specialized double sha256 for 64 byte inputs (Pieter Wuille)
57f3463 Refactor SHA256 code (Pieter Wuille)
0df0178 Benchmark Merkle root computation (Pieter Wuille)

Pull request description:

  This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:
  * Generic C++ (reusing the normal SHA256 code)
  * Specialized C++ for 64-byte inputs, but no special instructions
  * 4-way using SSE4.1 intrinsics
  * 8-way using AVX2 intrinsics

  On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):
  * 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
  * 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
  * 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
  * 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
  * 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)

Tree-SHA512: efa32d48b32820d9ce788ead4eb583949265be8c2e5f538c94bc914e92d131a57f8c1ee26c6f998e81fb0e30675d4e2eddc3360bcf632676249036018cff343e
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
1e1eb63 Improve coverage of SHA256 SelfTest code (Pieter Wuille)

Pull request description:

  The existing SelfTest code does not cover the specialized double-SHA256-for-64-byte-inputs transforms added in bitcoin#13191. Fix this.

Tree-SHA512: 593c7ee5dc9e77fc4c89e0a7753a63529b0d3d32ddbc015ae3895b52be77bee8a80bf16b754b30a22c01625a68db83fb77fa945a543143542bebb5b0f017ec5b
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
57ba401 Enable double-SHA256-for-64-byte code on 32-bit x86 (Pieter Wuille)

Pull request description:

  The SSE4 and AVX2 double-SHA256-for-64-byte input code from bitcoin#13191 compiles fine on 32-bit x86 systems, but the autodetection logic in sha256.cpp doesn't enable it. Fix this.

  Note that these instruction sets are only available on CPUs that support 64-bit mode as well, so it is only beneficial in the (perhaps unlikely) scenario where a 64-bit CPU is running a 32-bit Bitcoin Core binary.

Tree-SHA512: 39d5963c1ba8c33932549d5fe98bd184932689a40aeba95043eca31dd6824f566197c546b60905555eccaf407408a5f0f200247bb0907450d309b0a70b245102
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
…th SSE4.1 and AVX2

4defdfa [MOVEONLY] Move unused Merkle branch code to tests (Pieter Wuille)
4437d6e 8-way AVX2 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
230294b 4-way SSE4.1 implementation for double SHA256 on 64-byte inputs (Pieter Wuille)
1f0e7ca Use SHA256D64 in Merkle root computation (Pieter Wuille)
d0c9632 Specialized double sha256 for 64 byte inputs (Pieter Wuille)
57f3463 Refactor SHA256 code (Pieter Wuille)
0df0178 Benchmark Merkle root computation (Pieter Wuille)

Pull request description:

  This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:
  * Generic C++ (reusing the normal SHA256 code)
  * Specialized C++ for 64-byte inputs, but no special instructions
  * 4-way using SSE4.1 intrinsics
  * 8-way using AVX2 intrinsics

  On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):
  * 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
  * 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
  * 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
  * 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
  * 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)

Tree-SHA512: efa32d48b32820d9ce788ead4eb583949265be8c2e5f538c94bc914e92d131a57f8c1ee26c6f998e81fb0e30675d4e2eddc3360bcf632676249036018cff343e
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
1e1eb63 Improve coverage of SHA256 SelfTest code (Pieter Wuille)

Pull request description:

  The existing SelfTest code does not cover the specialized double-SHA256-for-64-byte-inputs transforms added in bitcoin#13191. Fix this.

Tree-SHA512: 593c7ee5dc9e77fc4c89e0a7753a63529b0d3d32ddbc015ae3895b52be77bee8a80bf16b754b30a22c01625a68db83fb77fa945a543143542bebb5b0f017ec5b
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
57ba401 Enable double-SHA256-for-64-byte code on 32-bit x86 (Pieter Wuille)

Pull request description:

  The SSE4 and AVX2 double-SHA256-for-64-byte input code from bitcoin#13191 compiles fine on 32-bit x86 systems, but the autodetection logic in sha256.cpp doesn't enable it. Fix this.

  Note that these instruction sets are only available on CPUs that support 64-bit mode as well, so it is only beneficial in the (perhaps unlikely) scenario where a 64-bit CPU is running a 32-bit Bitcoin Core binary.

Tree-SHA512: 39d5963c1ba8c33932549d5fe98bd184932689a40aeba95043eca31dd6824f566197c546b60905555eccaf407408a5f0f200247bb0907450d309b0a70b245102
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants