-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Add POWER8 ASM for 4-way SHA256 #13203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if this works as-is; either way, some cleanup needed to avoid breaking non-POWER builds.
configure.ac
Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([ | |||
) | |||
LDFLAGS="$TEMP_LDFLAGS" | |||
|
|||
TEMP_LDFLAGS="$LDFLAGS" | |||
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be CPPFLAGS or CXXFLAGS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.
src/Makefile.am
Outdated
@@ -309,6 +310,16 @@ crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2 | |||
endif | |||
crypto_libbitcoin_crypto_avx2_a_SOURCES = crypto/sha256_avx2.cpp | |||
|
|||
if HAVE_POWER8 | |||
crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably needed on CXXFLAGS, since the man page says it changes the code generated.
src/crypto/sha256.cpp
Outdated
@@ -29,6 +29,15 @@ namespace sha256d64_avx2 | |||
void Transform_8way(unsigned char* out, const unsigned char* in); | |||
} | |||
|
|||
#if defined(__linux__) && defined(HAVE_POWER8) | |||
#include <sys/auxv.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this here, and POWER vectoring stuff seems to screw with standard C++ stuff, so maybe not a good idea to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auxv isnt POWER- or vector-specific, its to get getauxval which is a libc function.
src/crypto/sha256.cpp
Outdated
@@ -511,6 +520,12 @@ std::string SHA256AutoDetect() | |||
ret = "sse4"; | |||
#endif | |||
} | |||
#elif (defined(__linux__)) && defined(HAVE_POWER8) | |||
if (getauxval(AT_HWCAP2) & 0x02000000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this check to a function in the power8-specific file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its right below the x86-specific support check, and its a short file, it seems like overkill.
src/crypto/sha256_power8.cpp
Outdated
{ | ||
|
||
typedef __vector uint32_t uint32x4_p8; | ||
typedef __vector uint8_t uint8x16_p8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__vector
is a non-standard GCC extension; any reason we can't use vector
here directly?
src/crypto/sha256_power8.cpp
Outdated
#include <config/bitcoin-config.h> | ||
#endif | ||
|
||
#ifdef HAVE_POWER8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this file is only compiled with HAVE_POWER8
, I think we don't need this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK!
First set of build-system comments, there are few things to sort out. Will continue reviewing after the first round of cleanups.
src/Makefile.am
Outdated
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS = $(CPPFLAGS) | ||
if ENABLE_AVX2 | ||
crypto_libbitcoin_crypto_avx2_a_CXXFLAGS += $(AVX2_CXXFLAGS) | ||
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't compiler defines be used instead so that non-autoconf builds can compile with optims too?
Specifically for this one, rather than ifdef(ENABLE_AVX2)
, use ifdef(__AVX2__)
configure.ac
Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([ | |||
) | |||
LDFLAGS="$TEMP_LDFLAGS" | |||
|
|||
TEMP_LDFLAGS="$LDFLAGS" | |||
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.
configure.ac
Outdated
@@ -327,6 +330,45 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ | |||
) | |||
CXXFLAGS="$TEMP_CXXFLAGS" | |||
|
|||
AC_DEFINE(ENABLE_AVX2, 1, [Define this symbol to build code that uses AVX2 intrinsics]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only define these in their respective check's success cases.
configure.ac
Outdated
} | ||
]])], | ||
[ | ||
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the defines/conditionals down with the other asm flags for consistency.
src/Makefile.am
Outdated
@@ -289,6 +294,32 @@ if USE_ASM | |||
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp | |||
endif | |||
|
|||
crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all be
src/Makefile.am
Outdated
@@ -289,6 +294,32 @@ if USE_ASM | |||
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp | |||
endif | |||
|
|||
crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS) | |||
crypto_libbitcoin_crypto_sse41_a_CPPFLAGS = $(CPPFLAGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for the CPPFLAGS.
src/Makefile.am
Outdated
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp | ||
|
||
LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8) | ||
EXTRA_LIBRARIES += $(LIBBITCOIN_CRYPTO_POWER8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append to EXTRA_LIBRARIES outside of the if HAVE_POWER8
. These libraries are only built if something depends on them.
src/Makefile.am
Outdated
crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) | ||
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp | ||
|
||
LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this works as intended or not, but we don't want to be doing it either way. Add LIBBITCOIN_CRYPTO_POWER8 to bitcoind_LDADD like the others, we can potentially aggregate them as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Its shorthand for "crypto depends on this" instead of listing it as a dep in every target that may use it. Its also much less britle, which is important given we dont have a POWER travis target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with aggregating, but as a clean follow-up, or as part of #13191. I'd just prefer not to have one-off's. It's not clear, for example, if EXTRA_LIBRARIES is appended twice, as it depends on when LIBBITCOIN_CRYPTO is calculated.
@luke-jr No, it definitely works, but its apparently /only/ tested on POWER, not "not yet tested" :p. |
184d626
to
9329896
Compare
configure.ac
Outdated
] | ||
) | ||
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8]) | ||
LDFLAGS="$TEMP_LDFLAGS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should reset CXXFLAGS, not LDFLAGS.
667816c
to
84c7c7d
Compare
configure.ac
Outdated
]])], | ||
[ | ||
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.]) | ||
AC_MSG_RESULT(yes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POWER8_ASM_CXXFLAGS="-mpower8-vector"
AC_MSG_RESULT(yes)
Then at the bottom, with the others:
AC_SUBST(POWER8_ASM_CXXFLAGS)
Then in the Makefile.am:
crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) $(POWER8_ASM_CXXFLAGS)
configure.ac
Outdated
AC_MSG_RESULT(no) | ||
] | ||
) | ||
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HAVE_POWER8 is a define that will be handled by autoheader, it can't be treated like a variable here. Need to set one separately.
spiffy! will test soon |
Rebased. |
46b601d
to
6174209
Compare
This speeds up 4-way SHA256 by about 3.75x over the C impl.
Previously checkqueue bench would essentially just test cross-core latency with more threads the more threads a host had. Not only did this make the result worse on hosts with more CPU cores, but it doesn't mimic how we use the checkqueue in real life, and made bench_bitcoin almost unusably slow on high-core-count systems, especially dual-CPU systems. Instead, we use the same thread count that we use for the real validation queue, with a 75-microsecond busy-wait on each thread, which should at least vaguely mimic how we use the checkqueue.
Rebased. Should be ready for merge +/- some more review, though luckily the tests should mostly cover it. |
Note that I also had to fix the checkqueue bench as otherwise bench_bitcoin is unsuably slow on multi-cpu (or many-core) systems. |
I'm unable to test this at the moment. Can someone please post the benchmark results before/after this? |
Needs rebase |
// file COPYING or http://www.opensource.org/licenses/mit-license.php. | ||
// | ||
// This is a translation to GCC extended asm syntax from YASM code by Intel | ||
// (available at the bottom of this file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any YASM code at the bottom; also, really from Intel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that was copied from the sse4 code.
#include <stdint.h> | ||
]], [[ | ||
unsigned char src[16]; | ||
__builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use vector
here. We don't use [the non-standard] __vector
in the code itself.
@@ -314,6 +314,7 @@ fi | |||
AX_CHECK_COMPILE_FLAG([-msse4.2],[[SSE42_CXXFLAGS="-msse4.2"]],,[[$CXXFLAG_WERROR]]) | |||
AX_CHECK_COMPILE_FLAG([-msse4.1],[[SSE41_CXXFLAGS="-msse4.1"]],,[[$CXXFLAG_WERROR]]) | |||
AX_CHECK_COMPILE_FLAG([-mavx -mavx2],[[AVX2_CXXFLAGS="-mavx -mavx2"]],,[[$CXXFLAG_WERROR]]) | |||
AX_CHECK_COMPILE_FLAG([-mpower8-vector],[[POWER8_CXXFLAGS="-mpower8-vector"]],,[[$CXXFLAG_WERROR]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think -faltivec
is needed too?
This needs a trivial rebase but with that works fine against master. ACK. Will post benchmarks after IBD finishes. |
Before: On power9. Interesting reindex speed was about the same for both. |
@TheBlueMatt Rebase. |
Hmm, well if it doesnt speed up rebase then there is something funky we should look into before merging new code, no? |
Gonna go ahead and close it, if someone wants to adopt it and figure out why IBD didn't improve where the equivalent x86 change improved it significantly, they should adopt it. |
The comparable change made a big impact on x86 many months ago-- who knows now... Synchronization primitives are more expensive on power, so it also wouldn't be surprising if our glib use of atomics all over the place weren't moving around the limiting factors in performance on power compared to x86. |
Based on #13191, This speeds up 4-way SHA256 by about 3.75x over the C impl.