Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

sipa · 2018-06-11T23:36:44Z

Currently, master contains 2 implementations of SHA256 for SSE4:

A generic one written using GCC inline assembly (converted from Intel NASM code), added in Add SSE4 optimized SHA256 #10821.
A specialized double-SHA256 for 64-byte inputs written using intrinsics, added in Specialized double-SHA256 with 64 byte inputs with SSE4.1 and AVX2 #13191.

The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn't even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.

This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.

The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.

theuni · 2018-06-12T03:00:22Z

Nice!

@sipa See theuni@d79fb1d for clang compile fixes, and theuni@4ee6fbb for a change that may or may not be needed to avoid a performance hit on AMD.

theuni · 2018-06-12T03:33:18Z

src/crypto/sha256_sse41.cpp

+    Round(a, b, c, d, e, f, g, h, Ws[0]);
+    XTMP0 = _mm_alignr_epi8(X3, X2, 4);
+    XTMP0 = _mm_add_epi32(XTMP0, X0);
+    XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4);


Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?

No idea. It's just a translation of the existing assembly code.

sipa · 2018-06-12T16:47:06Z

@theuni Included the clang compile fixes. I'm going to benchmark to see whether to include the other changes.

sipa · 2018-06-14T19:43:54Z

It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths.

sipa · 2018-06-18T22:20:28Z

Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the SHA256 benchmark (i7-7820HQ, fixed to 2.2 GHz).

GCC, master: 4.4 ms
GCC, this PR: 4.3 ms
clang, master: 4.4 ms
clang, this PR: 4.8 ms

Unfortunately, it seems that clang isn't as good in producing as performant code from intrinsics.

theuni · 2018-07-19T19:59:01Z

@sipa Mind rebasing? I'd like to add the lib-per-cpu changes on top of this.

sipa · 2018-07-19T21:12:34Z

Rebased, though I don't think this PR is acceptable until we have a way to avoid the performance loss in clang.

DrahtBot · 2018-07-28T20:56:59Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#13789 (crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang by luke-jr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/crypto/sha256_sse41.cpp

practicalswift · 2018-10-04T20:40:25Z

src/crypto/sha256_sse41.cpp

+    t2 = _mm_srli_epi32(t2, 7);
+    t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2);
+
+    Round(h, a, b, c, d, e, f, g, w32[1]);


Same here and throughout this function :-)

practicalswift · 2018-10-04T20:44:06Z

src/crypto/sha256.cpp

@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
 #endif

    if (have_sse4) {


Move the if statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL) to remove the possibility of an empty if statement.

maflcko · 2019-05-20T17:53:28Z

Are you still working on this?

sipa · 2019-05-20T17:55:05Z

What version of clang are we using now? It's probably not a good idea to proceed with this unless it can be shown it doesn't have negative impact on performance on all release platforms.

maflcko · 2019-05-20T18:12:41Z

3.7 (depends: Issue cross compiling for macOS on Debian Buster #16052) for gitian
FreeBSD 12: clang version 6.0.1
not sure what the default is on macos when self-compiled

sipa · 2019-05-20T18:24:15Z

I'll close this for now, then.

fanquake · 2019-05-21T04:55:18Z

@MarcoFalke, Clang on my macOS machine (Xcode 10.2.1) is:

Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

DesWurstes · 2019-05-21T12:46:04Z

...which is actually Clang 5.

fanquake · 2019-05-23T18:34:34Z

I'm going to reopen this, as we will be switching to a newer version of Clang in gitian.

sipa · 2019-05-23T19:05:13Z

I'll benchmark again in clang-7.

maflcko · 2019-05-23T19:16:41Z

I see a slowdown in SHA256 and SHA256_32b with both gcc-9 and clang-8

fanquake · 2019-06-24T08:09:31Z

Futher benchmarking reported here and outside this PR have shown that there are likely slowdown issues with this change and recent versions of Clang. Closing again for now.

hebasto · 2023-09-24T14:19:17Z

Picked up for MSVC builds in #28526.

fanquake added the Validation label Jun 12, 2018

sipa force-pushed the 201806_sse4intrin branch from 86e04f0 to 4f5e45a Compare June 12, 2018 03:14

theuni reviewed Jun 12, 2018

View reviewed changes

sipa force-pushed the 201806_sse4intrin branch 3 times, most recently from b0c24e2 to 5f4c79e Compare June 12, 2018 16:46

sipa force-pushed the 201806_sse4intrin branch from 5f4c79e to 9fe51b4 Compare June 12, 2018 18:48

DrahtBot mentioned this pull request Jun 14, 2018

SHA256 implementations based on Intel SHA Extensions #13386

Merged

DrahtBot mentioned this pull request Jul 8, 2018

[bugfix] Use __cpuid_count for gnu C to avoid gitian build fail. #13611

Merged

DrahtBot added the Needs rebase label Jul 9, 2018

sipa force-pushed the 201806_sse4intrin branch from 9fe51b4 to 8655e78 Compare July 19, 2018 21:11

DrahtBot removed the Needs rebase label Jul 20, 2018

DrahtBot mentioned this pull request Jul 28, 2018

crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang #13789

Closed

practicalswift reviewed Oct 4, 2018

View reviewed changes

src/crypto/sha256_sse41.cpp Show resolved Hide resolved

practicalswift reviewed Oct 4, 2018

View reviewed changes

bitcoin deleted a comment from STALININST Oct 4, 2018

sipa added 3 commits October 12, 2018 16:40

Add 1-way SSE4 SHA256 implementation using intrinsics

797f6c1

Switch 1-way SSE4 SHA256 to intrinsics based implementation

7ebae4d

Remove SSE4 assembly implementation

4a221ce

sipa force-pushed the 201806_sse4intrin branch from 8655e78 to 4a221ce Compare October 12, 2018 23:40

maflcko closed this May 20, 2019

maflcko reopened this May 20, 2019

sipa closed this May 20, 2019

fanquake reopened this May 23, 2019

fanquake closed this Jun 24, 2019

fanquake added the Future label Jun 24, 2019

laanwj added this to the Future milestone Sep 30, 2019

laanwj removed the Future label Sep 30, 2019

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

maflcko removed this from the Future milestone Jul 23, 2025

maflcko added the Up for grabs label Jul 23, 2025

		@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
		#endif

		if (have_sse4) {

Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

Uh oh!

Conversation

sipa commented Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theuni commented Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theuni Jun 12, 2018

Choose a reason for hiding this comment

Uh oh!

sipa Jun 12, 2018

Choose a reason for hiding this comment

Uh oh!

sipa commented Jun 12, 2018

Uh oh!

sipa commented Jun 14, 2018

Uh oh!

sipa commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theuni commented Jul 19, 2018

Uh oh!

sipa commented Jul 19, 2018

Uh oh!

DrahtBot commented Jul 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

Uh oh!

practicalswift Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

practicalswift Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

maflcko commented May 20, 2019

Uh oh!

sipa commented May 20, 2019

Uh oh!

maflcko commented May 20, 2019

Uh oh!

sipa commented May 20, 2019

Uh oh!

fanquake commented May 21, 2019

Uh oh!

DesWurstes commented May 21, 2019

Uh oh!

fanquake commented May 23, 2019

Uh oh!

sipa commented May 23, 2019

Uh oh!

maflcko commented May 23, 2019

Uh oh!

fanquake commented Jun 24, 2019

Uh oh!

hebasto commented Sep 24, 2023

Uh oh!

Uh oh!

sipa commented Jun 11, 2018 •

edited

Loading

theuni commented Jun 12, 2018 •

edited

Loading

sipa commented Jun 18, 2018 •

edited

Loading

DrahtBot commented Jul 28, 2018 •

edited

Loading