Skip to content

Conversation

sipa
Copy link
Member

@sipa sipa commented Jul 14, 2017

This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with --enable-experimental-asm.

In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

This gives around a 50% speedup on the SHA256 benchmark for me.

It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.


#if defined(__x86_64__) || defined(__amd64__)
uint32_t eax, ebx, ecx, edx;
if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 20) & 1) {
Copy link
Member

@laanwj laanwj Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to do this setup explicitly during initialization; this also avoids having to use an atomic pointer, which seems overkill (why would it ever change during runtime?) and may be inefficient on some platforms.
(also the detection might be more involved on some platforms, so it's better for clarity to drive it from an init function instead of magically at first call).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the option of using the ifunc attribute, supported on recent binutils with at least gcc and clang.

Though it's non-standard and afaik elf-specific, it's worth considering where possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have constructors with hashing in them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laanwj Fixed.

@luke-jr
Copy link
Member

luke-jr commented Jul 14, 2017

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

@sipa
Copy link
Member Author

sipa commented Jul 14, 2017

@luke-jr There are system macros to test whether you're compiling for x86_64 or not.

@luke-jr
Copy link
Member

luke-jr commented Jul 14, 2017

You said almost every x86_64 CPU. Are we going to drop support for the outliers then?

@meshcollider
Copy link
Contributor

One of the travis builds obviously has an issue with it too:
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available

@theuni
Copy link
Member

theuni commented Jul 14, 2017

The clang/osx build succeeds when -fomit-frame-pointer is used. I don't speak enough asm to know if a register can be freed up.

@gmaxwell
Copy link
Contributor

Even with inline assembly, there are build complications unfortunately. The compile will fail if the target doesn't support it..

No it won't-- these files are compiled without -msse4.2 already. The only thing required is that its x86_64, which the build tests for.

@sipa
Copy link
Member Author

sipa commented Jul 14, 2017

@luke-jr There is runtime detection to see if the CPU supports the extension. The only requirement is that the target is x86_64.

@jonasschnelli
Copy link
Contributor

jonasschnelli commented Jul 14, 2017

Gitian OSX build is broken (https://bitcoin.jonasschnelli.ch/build/216):

Generated test/data/base58_keys_invalid.json.h
crypto/sha256_sse42.cpp:42:9: error: inline assembly requires more registers than available
        "shl    $0x6,%2;"
        ^
1 error generated.

No problem on Win/ OSX Linux

@sipa
Copy link
Member Author

sipa commented Jul 14, 2017

@jonasschnelli @theuni figured it out - clang isn't compiling with -fomit-frame-pointer, and thus there is one fewer register available. Unfortunately, omitting the frame pointer still makes this code not work...

@sipa sipa force-pushed the 20170713_shasse branch from 22b2311 to 2af11d0 Compare July 14, 2017 22:10
@sipa
Copy link
Member Author

sipa commented Jul 14, 2017

Updated the code to use one fewer register. The original YASM code used the dx register for two purposes, which I had separated out into two separate registers. They're merged now.

@sipa sipa force-pushed the 20170713_shasse branch 2 times, most recently from 423f30d to dc1fa84 Compare July 14, 2017 22:54
; documentation and/or other materials provided with the
; distribution.
;
; * Neither the name of the Intel Corporation nor the names of its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're gonna have to do something to meet this condition, though it doesnt appear we'd have to do much.

Copy link
Contributor

@gmaxwell gmaxwell Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the standard three clause BSD license, it is GPL and whatnot compatible. The source code to Bitcoin, which contains this notice, is part of the "documentation and/or other materials" we provide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ship sans-source all the time? I figured we'd just put a "contains softare copyright Intel" in the --help output or a README somewhere.

@sipa sipa force-pushed the 20170713_shasse branch from dc1fa84 to ef029f1 Compare July 15, 2017 00:06
@sipa sipa changed the title Add SSE 4.2 optimized SHA256 [WIP] Add SSE 4.2 optimized SHA256 Jul 15, 2017
@sipa
Copy link
Member Author

sipa commented Jul 15, 2017

Marking as WIP, as this does not seem to produce correct hashes on OSX (cc @theuni).

@sipa sipa force-pushed the 20170713_shasse branch from ef029f1 to d0d68be Compare July 15, 2017 03:38
@theuni
Copy link
Member

theuni commented Jul 15, 2017

I poked at this for hours and came up empty-handed. I'll wait for someone else to confirm my osx breakage isn't just local.

@theuni
Copy link
Member

theuni commented Jul 15, 2017

two more data points:

  1. @fanquake verified that this crashes on osx for him as well.

  2. I managed to reproduce a crash on Linux with an old clang (3.2), and it's even uglier, crashing gdb as well:

Starting program: /home/cory/dev/bitcoin2/src/bitcoind
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
0x000000000074c910 in sha256_sse42::Transform (
/build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n /build/buildd/gdb-7.6~20130417/gdb/dwarf2read.c:10350: internal-error: dwarf2_record_block_ranges: Assertion dwarf2_per_objfile->ranges.readin' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
Segmentation fault (core dumped)

@sipa sipa force-pushed the 20170713_shasse branch 3 times, most recently from db8ef97 to 08b7438 Compare July 16, 2017 06:37
@theuni
Copy link
Member

theuni commented Jul 16, 2017

Tested ACK 08b7438f73236fc738fb655f766e77a81e6b7311. Good on OSX now!

Edit: Though I'd prefer to have the cpu check done separately.

@sipa sipa changed the title [WIP] Add SSE 4.2 optimized SHA256 Add SSE 4.2 optimized SHA256 Jul 16, 2017
@sipa
Copy link
Member Author

sipa commented Jul 16, 2017

Removing WIP tag, I believe we solved the OSX problem.

@theuni
Copy link
Member

theuni commented Jul 20, 2017

utACK 6b8d872, though I extensively tested earlier revisions.

@laanwj laanwj merged commit 6b8d872 into bitcoin:master Jul 20, 2017
laanwj added a commit that referenced this pull request Jul 20, 2017
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
@jnewbery jnewbery mentioned this pull request Jul 31, 2017
12 tasks
sickpig referenced this pull request in sickpig/BitcoinUnlimited Sep 13, 2017
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
gandrewstone referenced this pull request in gandrewstone/BitcoinUnlimited Sep 13, 2017
Port of Core PR  #10821: Add SSE4 optimized SHA256
sickpig referenced this pull request in sickpig/BitcoinUnlimited Oct 13, 2017
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
gandrewstone referenced this pull request in BitcoinUnlimited/BitcoinUnlimited Oct 19, 2017
Backport of Core #10821 and #11176
@Sjors
Copy link
Member

Sjors commented Oct 20, 2017

For future reference, as of #11176 this is now enabled by default.

laanwj added a commit that referenced this pull request Jul 9, 2018
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on #13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to #13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-#10821): 6.1ms
  * Using SSE4 (master, #10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 17, 2019
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 17, 2019
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 23, 2019
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 24, 2019
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
PastaPastaPasta added a commit to PastaPastaPasta/dash that referenced this pull request Jul 24, 2019
Signed-off-by: Pasta <pasta@dashboost.org>
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
wqking added a commit to wqking-misc/Phore that referenced this pull request Jan 22, 2020
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
6b8d872 Protect SSE4 code behind a compile-time flag (Pieter Wuille)
fa9be90 Add selftest for SHA256 transform (Pieter Wuille)
c1ccb15 Add SSE4 based SHA256 (Pieter Wuille)
2991c91 Add SHA256 dispatcher (Pieter Wuille)
4d50f38 Support multi-block SHA256 transforms (Pieter Wuille)

Pull request description:

  This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with `--enable-experimental-asm`.

  In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.

  This gives around a 50% speedup on the SHA256 benchmark for me.

  It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.

Tree-SHA512: d31c50695ceb45264291537b93c0d7497670be38edf021ca5402eaa7d4e1e0e1ae492326e28d4e93979d066168129e62d1825e0384b1b906d36f85d93dfcb43c
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
Signed-off-by: Pasta <pasta@dashboost.org>
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants