-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice! @sipa See theuni@d79fb1d for clang compile fixes, and theuni@4ee6fbb for a change that may or may not be needed to avoid a performance hit on AMD. |
src/crypto/sha256_sse41.cpp
Outdated
Round(a, b, c, d, e, f, g, h, Ws[0]); | ||
XTMP0 = _mm_alignr_epi8(X3, X2, 4); | ||
XTMP0 = _mm_add_epi32(XTMP0, X0); | ||
XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. It's just a translation of the existing assembly code.
b0c24e2
to
5f4c79e
Compare
@theuni Included the clang compile fixes. I'm going to benchmark to see whether to include the other changes. |
It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths. |
Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the
Unfortunately, it seems that clang isn't as good in producing as performant code from intrinsics. |
@sipa Mind rebasing? I'd like to add the lib-per-cpu changes on top of this. |
Rebased, though I don't think this PR is acceptable until we have a way to avoid the performance loss in clang. |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
t2 = _mm_srli_epi32(t2, 7); | ||
t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2); | ||
|
||
Round(h, a, b, c, d, e, f, g, w32[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and throughout this function :-)
@@ -615,12 +615,9 @@ std::string SHA256AutoDetect() | |||
#endif | |||
|
|||
if (have_sse4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the if
statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL)
to remove the possibility of an empty if statement.
8655e78
to
4a221ce
Compare
Are you still working on this? |
What version of clang are we using now? It's probably not a good idea to proceed with this unless it can be shown it doesn't have negative impact on performance on all release platforms. |
|
I'll close this for now, then. |
@MarcoFalke, Clang on my macOS machine (Xcode 10.2.1) is:
|
...which is actually Clang 5. |
I'm going to reopen this, as we will be switching to a newer version of |
I'll benchmark again in clang-7. |
I see a slowdown in |
Futher benchmarking reported here and outside this PR have shown that there are likely slowdown issues with this change and recent versions of Clang. Closing again for now. |
Picked up for MSVC builds in #28526. |
Currently, master contains 2 implementations of SHA256 for SSE4:
The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn't even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.
This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.
The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.