Skip to content

Conversation

TheBlueMatt
Copy link
Contributor

Based on #13191, This speeds up 4-way SHA256 by about 3.75x over the C impl.

@laanwj
Copy link
Member

laanwj commented May 9, 2018

Cool!
Ping @luke-jr

Copy link
Member

@luke-jr luke-jr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if this works as-is; either way, some cleanup needed to avoid breaking non-POWER builds.

configure.ac Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([
)
LDFLAGS="$TEMP_LDFLAGS"

TEMP_LDFLAGS="$LDFLAGS"
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be CPPFLAGS or CXXFLAGS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.

src/Makefile.am Outdated
@@ -309,6 +310,16 @@ crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
endif
crypto_libbitcoin_crypto_avx2_a_SOURCES = crypto/sha256_avx2.cpp

if HAVE_POWER8
crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably needed on CXXFLAGS, since the man page says it changes the code generated.

@@ -29,6 +29,15 @@ namespace sha256d64_avx2
void Transform_8way(unsigned char* out, const unsigned char* in);
}

#if defined(__linux__) && defined(HAVE_POWER8)
#include <sys/auxv.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this here, and POWER vectoring stuff seems to screw with standard C++ stuff, so maybe not a good idea to have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auxv isnt POWER- or vector-specific, its to get getauxval which is a libc function.

@@ -511,6 +520,12 @@ std::string SHA256AutoDetect()
ret = "sse4";
#endif
}
#elif (defined(__linux__)) && defined(HAVE_POWER8)
if (getauxval(AT_HWCAP2) & 0x02000000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this check to a function in the power8-specific file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its right below the x86-specific support check, and its a short file, it seems like overkill.

{

typedef __vector uint32_t uint32x4_p8;
typedef __vector uint8_t uint8x16_p8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__vector is a non-standard GCC extension; any reason we can't use vector here directly?

#include <config/bitcoin-config.h>
#endif

#ifdef HAVE_POWER8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this file is only compiled with HAVE_POWER8, I think we don't need this here.

Copy link
Member

@theuni theuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK!

First set of build-system comments, there are few things to sort out. Will continue reviewing after the first round of cleanups.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS = $(CPPFLAGS)
if ENABLE_AVX2
crypto_libbitcoin_crypto_avx2_a_CXXFLAGS += $(AVX2_CXXFLAGS)
crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't compiler defines be used instead so that non-autoconf builds can compile with optims too?

Specifically for this one, rather than ifdef(ENABLE_AVX2), use ifdef(__AVX2__)

configure.ac Outdated
@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([
)
LDFLAGS="$TEMP_LDFLAGS"

TEMP_LDFLAGS="$LDFLAGS"
LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.

configure.ac Outdated
@@ -327,6 +330,45 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
)
CXXFLAGS="$TEMP_CXXFLAGS"

AC_DEFINE(ENABLE_AVX2, 1, [Define this symbol to build code that uses AVX2 intrinsics])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only define these in their respective check's success cases.

configure.ac Outdated
}
]])],
[
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the defines/conditionals down with the other asm flags for consistency.

src/Makefile.am Outdated
@@ -289,6 +294,32 @@ if USE_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif

crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all be $(AM_CXXFLAGS) $(PIE_FLAGS).

src/Makefile.am Outdated
@@ -289,6 +294,32 @@ if USE_ASM
crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
endif

crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
crypto_libbitcoin_crypto_sse41_a_CPPFLAGS = $(CPPFLAGS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for the CPPFLAGS.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp

LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
EXTRA_LIBRARIES += $(LIBBITCOIN_CRYPTO_POWER8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append to EXTRA_LIBRARIES outside of the if HAVE_POWER8. These libraries are only built if something depends on them.

src/Makefile.am Outdated
crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS)
crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp

LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this works as intended or not, but we don't want to be doing it either way. Add LIBBITCOIN_CRYPTO_POWER8 to bitcoind_LDADD like the others, we can potentially aggregate them as a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Its shorthand for "crypto depends on this" instead of listing it as a dep in every target that may use it. Its also much less britle, which is important given we dont have a POWER travis target.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with aggregating, but as a clean follow-up, or as part of #13191. I'd just prefer not to have one-off's. It's not clear, for example, if EXTRA_LIBRARIES is appended twice, as it depends on when LIBBITCOIN_CRYPTO is calculated.

@TheBlueMatt
Copy link
Contributor Author

@luke-jr No, it definitely works, but its apparently /only/ tested on POWER, not "not yet tested" :p.

@TheBlueMatt TheBlueMatt force-pushed the 2018-05-asm branch 2 times, most recently from 184d626 to 9329896 Compare May 9, 2018 21:05
configure.ac Outdated
]
)
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
LDFLAGS="$TEMP_LDFLAGS"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reset CXXFLAGS, not LDFLAGS.

@TheBlueMatt TheBlueMatt force-pushed the 2018-05-asm branch 2 times, most recently from 667816c to 84c7c7d Compare May 9, 2018 22:27
configure.ac Outdated
]])],
[
AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
AC_MSG_RESULT(yes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POWER8_ASM_CXXFLAGS="-mpower8-vector"
AC_MSG_RESULT(yes)

Then at the bottom, with the others:

AC_SUBST(POWER8_ASM_CXXFLAGS)

Then in the Makefile.am:

crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) $(POWER8_ASM_CXXFLAGS)

configure.ac Outdated
AC_MSG_RESULT(no)
]
)
AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAVE_POWER8 is a define that will be handled by autoheader, it can't be treated like a variable here. Need to set one separately.

@gmaxwell
Copy link
Contributor

spiffy! will test soon

@TheBlueMatt
Copy link
Contributor Author

Rebased.

This speeds up 4-way SHA256 by about 3.75x over the C impl.
Previously checkqueue bench would essentially just test cross-core
latency with more threads the more threads a host had. Not only did
this make the result worse on hosts with more CPU cores, but it
doesn't mimic how we use the checkqueue in real life, and made
bench_bitcoin almost unusably slow on high-core-count systems,
especially dual-CPU systems.

Instead, we use the same thread count that we use for the real
validation queue, with a 75-microsecond busy-wait on each thread,
which should at least vaguely mimic how we use the checkqueue.
@TheBlueMatt
Copy link
Contributor Author

Rebased. Should be ready for merge +/- some more review, though luckily the tests should mostly cover it.

@TheBlueMatt
Copy link
Contributor Author

Note that I also had to fix the checkqueue bench as otherwise bench_bitcoin is unsuably slow on multi-cpu (or many-core) systems.

@laanwj
Copy link
Member

laanwj commented Jun 14, 2018

I'm unable to test this at the moment. Can someone please post the benchmark results before/after this?

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 9, 2018

Needs rebase

// file COPYING or http://www.opensource.org/licenses/mit-license.php.
//
// This is a translation to GCC extended asm syntax from YASM code by Intel
// (available at the bottom of this file).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any YASM code at the bottom; also, really from Intel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that was copied from the sse4 code.

#include <stdint.h>
]], [[
unsigned char src[16];
__builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use vector here. We don't use [the non-standard] __vector in the code itself.

@@ -314,6 +314,7 @@ fi
AX_CHECK_COMPILE_FLAG([-msse4.2],[[SSE42_CXXFLAGS="-msse4.2"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-msse4.1],[[SSE41_CXXFLAGS="-msse4.1"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-mavx -mavx2],[[AVX2_CXXFLAGS="-mavx -mavx2"]],,[[$CXXFLAG_WERROR]])
AX_CHECK_COMPILE_FLAG([-mpower8-vector],[[POWER8_CXXFLAGS="-mpower8-vector"]],,[[$CXXFLAG_WERROR]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think -faltivec is needed too?

@gmaxwell
Copy link
Contributor

This needs a trivial rebase but with that works fine against master. ACK. Will post benchmarks after IBD finishes.

@gmaxwell
Copy link
Contributor

Before:
MerkleRoot, 5, 800, 21.6763, 0.00541869, 0.00541984, 0.00541886
After:
MerkleRoot, 5, 800, 6.32571, 0.00158138, 0.00158148, 0.00158142

On power9.

Interesting reindex speed was about the same for both.

@gmaxwell
Copy link
Contributor

@TheBlueMatt Rebase.

@TheBlueMatt
Copy link
Contributor Author

Hmm, well if it doesnt speed up rebase then there is something funky we should look into before merging new code, no?

@TheBlueMatt
Copy link
Contributor Author

Gonna go ahead and close it, if someone wants to adopt it and figure out why IBD didn't improve where the equivalent x86 change improved it significantly, they should adopt it.

@gmaxwell
Copy link
Contributor

The comparable change made a big impact on x86 many months ago-- who knows now... Synchronization primitives are more expensive on power, so it also wouldn't be surprising if our glib use of atomics all over the place weren't moving around the limiting factors in performance on power compared to x86.

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants