Skip to content

Conversation

sipa
Copy link
Member

@sipa sipa commented Jun 11, 2018

Currently, master contains 2 implementations of SHA256 for SSE4:

The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn't even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.

This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.

The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.

@theuni
Copy link
Member

theuni commented Jun 12, 2018

Nice!

@sipa See theuni@d79fb1d for clang compile fixes, and theuni@4ee6fbb for a change that may or may not be needed to avoid a performance hit on AMD.

@sipa sipa force-pushed the 201806_sse4intrin branch from 86e04f0 to 4f5e45a Compare June 12, 2018 03:14
Round(a, b, c, d, e, f, g, h, Ws[0]);
XTMP0 = _mm_alignr_epi8(X3, X2, 4);
XTMP0 = _mm_add_epi32(XTMP0, X0);
XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. It's just a translation of the existing assembly code.

@sipa sipa force-pushed the 201806_sse4intrin branch 3 times, most recently from b0c24e2 to 5f4c79e Compare June 12, 2018 16:46
@sipa
Copy link
Member Author

sipa commented Jun 12, 2018

@theuni Included the clang compile fixes. I'm going to benchmark to see whether to include the other changes.

@sipa sipa force-pushed the 201806_sse4intrin branch from 5f4c79e to 9fe51b4 Compare June 12, 2018 18:48
@sipa
Copy link
Member Author

sipa commented Jun 14, 2018

It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths.

@sipa
Copy link
Member Author

sipa commented Jun 18, 2018

Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the SHA256 benchmark (i7-7820HQ, fixed to 2.2 GHz).

  • GCC, master: 4.4 ms
  • GCC, this PR: 4.3 ms
  • clang, master: 4.4 ms
  • clang, this PR: 4.8 ms

Unfortunately, it seems that clang isn't as good in producing as performant code from intrinsics.

@theuni
Copy link
Member

theuni commented Jul 19, 2018

@sipa Mind rebasing? I'd like to add the lib-per-cpu changes on top of this.

@sipa sipa force-pushed the 201806_sse4intrin branch from 9fe51b4 to 8655e78 Compare July 19, 2018 21:11
@sipa
Copy link
Member Author

sipa commented Jul 19, 2018

Rebased, though I don't think this PR is acceptable until we have a way to avoid the performance loss in clang.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 28, 2018

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #13789 (crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang by luke-jr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

t2 = _mm_srli_epi32(t2, 7);
t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2);

Round(h, a, b, c, d, e, f, g, w32[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and throughout this function :-)

@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
#endif

if (have_sse4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the if statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL) to remove the possibility of an empty if statement.

@bitcoin bitcoin deleted a comment from STALININST Oct 4, 2018
@sipa sipa force-pushed the 201806_sse4intrin branch from 8655e78 to 4a221ce Compare October 12, 2018 23:40
@maflcko
Copy link
Member

maflcko commented May 20, 2019

Are you still working on this?

@maflcko maflcko closed this May 20, 2019
@maflcko maflcko reopened this May 20, 2019
@sipa
Copy link
Member Author

sipa commented May 20, 2019

What version of clang are we using now? It's probably not a good idea to proceed with this unless it can be shown it doesn't have negative impact on performance on all release platforms.

@maflcko
Copy link
Member

maflcko commented May 20, 2019

@sipa
Copy link
Member Author

sipa commented May 20, 2019

I'll close this for now, then.

@sipa sipa closed this May 20, 2019
@fanquake
Copy link
Member

@MarcoFalke, Clang on my macOS machine (Xcode 10.2.1) is:

Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

@DesWurstes
Copy link
Contributor

...which is actually Clang 5.

@fanquake
Copy link
Member

I'm going to reopen this, as we will be switching to a newer version of Clang in gitian.

@fanquake fanquake reopened this May 23, 2019
@sipa
Copy link
Member Author

sipa commented May 23, 2019

I'll benchmark again in clang-7.

@maflcko
Copy link
Member

maflcko commented May 23, 2019

I see a slowdown in SHA256 and SHA256_32b with both gcc-9 and clang-8

@fanquake
Copy link
Member

Futher benchmarking reported here and outside this PR have shown that there are likely slowdown issues with this change and recent versions of Clang. Closing again for now.

@fanquake fanquake closed this Jun 24, 2019
@laanwj laanwj added this to the Future milestone Sep 30, 2019
@laanwj laanwj removed the Future label Sep 30, 2019
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
@hebasto
Copy link
Member

hebasto commented Sep 24, 2023

Picked up for MSVC builds in #28526.

@maflcko maflcko removed this from the Future milestone Jul 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants