-
Notifications
You must be signed in to change notification settings - Fork 37.7k
[POC] C++20 std::endian
#28674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] C++20 std::endian
#28674
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
88aec3a
to
c9b1702
Compare
Concept ACK |
1 similar comment
Concept ACK |
Concept ACK |
c9b1702
to
f3c2a2a
Compare
f3c2a2a
to
aebee53
Compare
I've been experimenting with benchmark comparisons between pulls and master on my test coverage tool corecheck (https://corecheck.dev/bitcoin/bitcoin/pulls/28674). Note that the benchmarks are being run on ARM64 CPUs, but I tested locally on an x86 and the performance loss is roughly the same. |
cc @theuni we might not always be getting bswaps |
@aureleoules Thanks for testing! I tested this by comparing the x86_64 asm output of a simple test file and confirmed that it compiled down to bswaps as necessary. I didn't do the same for other arches. Perhaps others are missing the critical optimizations? :( Could you say more about your compiler/flags? I would expect to see the nasty behavior you're seeing:
Would you mind pasting the flags use for compile? Without |
I'd presume it is https://github.com/corecheck/coverage-worker/blob/7d4767493be390399b54ea3cd8cafc2f068c19e2/entrypoint.sh#L58 (default flags |
Yes that is correct. Also the compiler installed is |
Here is the output of the configure script of the corecheck job if this helps:
|
Same bench result here, compiling with
|
It would be useful to have compiler/version used/C{XX} flags/configure options (even if just mentioning it's the default) displayed next to the benchmarking information, as I assume this will always be one of the first questions asked in relation to benchmark output. |
aebee53
to
38bad58
Compare
Rebased on master/the current C++20 PR, which should improve the CI here.
Maybe something was being cached, as this seems to have updated now. |
Yes it may take some time for the sonarcloud worker to finish, as well as sonarcloud to refresh the results. |
Hmm, playing around with this code on godbolt, it hardly ever compiles down to bswaps. I'm not sure what changed from when I was initially investigating. Will have a look. |
Ok, interesting, the problem is pretty easy to see here: https://godbolt.org/z/d5EGs8Ybs GCC and MSVC both do the same thing: the optimization is done on the small static function, but it doesn't carry through when inlined. clang's output looks as expected. That explains why my test cases look good, but real world performance took a nosedive. Also interesting, switching the function to a macro doesn't change anything. @aureleoules How hard would it be to switch to using clang for a test-run to compare? I wonder if this is something GCC would be interested in taking a look at. clang can obviously do the right thing. |
Are you sure? When I tried yesterday, it did not (#28674 (comment)) |
Absolutely not! I'm making this up as I go.
Thanks, I guess that's the test I just asked for. Ok, I'll keep poking. But it seems like this approach likely won't work :( |
What about hiding the function inside a compilation unit, where they are turned into |
- use of std::endian means that we no longer have to rely on configure checks and compiler macros for compile-time endianness checks. - countl_zero replaces the need for __builtin_clz* - trust the compiler to optimize bswaps and drop platform-dependent implementations
38bad58
to
51e4a8f
Compare
++ret; | ||
} | ||
return ret; | ||
return (sizeof(x) * 8 )- std::countl_zero(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could split this up from the other changes?
nit: Also missing clang-format on this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are essentially 3 changes here:
- using
std::endian
- using
std::countl_zero
- removing the non-standard byteswaps
The first two are useful regardless, though obviously much less so without doing the third.
I'll update this to break up the commits.
As an alternative to what's here, I'm going to try switching our byteswaps to compiler builtins for gcc/clang/msvc. That still leaves us with compiler-specific behavior, but that would at least be an improvement over the current required autoconf/cmake tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what happens here, given we only use CountBits
in two places, if this is going to basically become a std lib call, we could just use it directly where needed, and remove our CountBits
unit/fuzz tests, so that we aren't nearly unit testing / fuzzing the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to leave this with its comment. Note that we don't use std::counl_zero
directly, we want kinda the opposite: (sizeof(x) * 8 )- std::countl_zero(x);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worked on a branch last week that left me very confused: https://github.com/theuni/bitcoin/commits/pr28674-rebase2/
It's the last commit (which I would expect to be a no-op) which causes the slowdown. For whatever reason, the built-in functions like be64toh
are much faster than my hand-written internal ones which implement the same code that glibc does in its headers! I'm still scratching my head trying to work out what's happening.
But for now, now that we have c++20, I'll go ahead and PR the first two parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to leave this with its comment. Note that we don't use
std::counl_zero
directly, we want kinda the opposite:(sizeof(x) * 8 )- std::countl_zero(x);
Actually, as it turns out, std::bit_width is exactly what we want (and it's even described in terms of countl_zero
). I'll see if it works to use your suggestion with that function instead.
Hah, that's fun. Too fun in fact, so I'm not going to try because it would make things even more complicated if it worked :p |
Hmm. I have something working, but as usual, MSVC is the odd case. Looking at the current code, it's not clear to me how MSVC ends up doing byteswaps because we don't have any detection for it. But I also doubt that it's finding the functions in So I think it's possible that MSVC is always taking the slow path? If that's the case, fixing this here would give it a big speedup. @hebasto could you check with MSVC? It should be easy to check with master with something like: diff --git a/src/compat/byteswap.h b/src/compat/byteswap.h
index 9ee71ef267d..49e17c68828 100644
--- a/src/compat/byteswap.h
+++ b/src/compat/byteswap.h
@@ -43,6 +43,7 @@ inline uint32_t bswap_32(uint32_t x)
#if HAVE_DECL_BSWAP_64 == 0
inline uint64_t bswap_64(uint64_t x)
{
+#error using slow path
return (((x & 0xff00000000000000ull) >> 56)
| ((x & 0x00ff000000000000ull) >> 40)
| ((x & 0x0000ff0000000000ull) >> 24) |
bitcoin/build_msvc/bitcoin_config.h.in Line 69 in 3e69125
|
Thanks! theuni@bbc7203 ready for PR if it indeed provides a speedup. I don't see how it couldn't :) |
return x ? 8 * sizeof(unsigned long) - __builtin_clzl(x) : 0; | ||
} | ||
#endif | ||
#if HAVE_BUILTIN_CLZLL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this from configure.ac now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it. Will do in the updated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be possible after we pull the changes from minisketch: bitcoin-core/minisketch#80.
86b7f28 serialization: use internal endian conversion functions (Cory Fields) 432b18c serialization: detect byteswap builtins without autoconf tests (Cory Fields) 297367b crypto: replace CountBits with std::bit_width (Cory Fields) 52f9bba crypto: replace non-standard CLZ builtins with c++20's bit_width (Cory Fields) Pull request description: This replaces #28674, #29036, and #29057. Now ready for testing and review. Replaces platform-specific endian and byteswap functions. This is especially useful for kernel, as it means that our deep serialization code no longer requires bitcoin-config.h. I apologize for the size of the last commit, but it's hard to avoid making those changes at once. All platforms now use our internal functions rather than libc or platform-specific ones, with the exception of MSVC. Sadly, benchmarking showed that not all compilers are capable of detecting and optimizing byteswap functions, so compiler builtins are instead used where possible. However, they're now detected via macros rather than autoconf checks. This[ matches how libc++ implements std::byteswap for c++23](https://github.com/llvm/llvm-project/blob/main/libcxx/include/__bit/byteswap.h#L26). I suggest we move/rename `compat/endian.h`, but I left that out of this PR to avoid bikeshedding. #29057 pointed out some irregularities in benchmarks. After messing with various compilers and configs for a few weeks with these changes, I'm of the opinion that we can't win on every platform every time, so we should take the code that makes sense going forward. That said, if any real-world slowdowns are caused here, we should obviously investigate. ACKs for top commit: maflcko: ACK 86b7f28 📘 fanquake: ACK 86b7f28 - we can finish pruning out the __builtin_clz* checks/usage once the minisketch code has been updated. This is more good cleanup pre-CMake & for the kernal. Tree-SHA512: 715a32ec190c70505ffbce70bfe81fc7b6aa33e376b60292e801f60cf17025aabfcab4e8c53ebb2e28ffc5cf4c20b74fe3dd8548371ad772085c13aec8b7970e
Using
std::endian
from C++20 allows us to drop some amount of own code, including infra in the build system (which means we don't have to port and review it for CMake, only to delete it shortly after switching to C++20).Note that C++23 would take this even further, with the introduction of
std::byteswap
.