-
Notifications
You must be signed in to change notification settings - Fork 56
charconv improvements #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6f9b843
to
ff61ad9
Compare
[NOTE: always open the benchmark links in a new tab, or github will not go to the linked line, HTH]. @fargies tagging you here because you already know a bit about this part of the code and suggested some optimizations. I removed the branching when appending characters in But while looking closer at the benchmark results, I noticed that For example int32_t MSVC has In GCC the relationship is much better but still basically the same: uint32_t GCC has atou() 3-4x slower than read_dec() and int32_t GCC is 3-4x slower. I haven't profiled the code yet as my time is extremely limited, but looking at the code for atoi() or for atou() I'm not seeing anything egregious to cause this sort of performance degradation from the main worker functions Since you had some ideas in #75 , I'm asking here if you see anything that may be contributing to this cost. |
I also saw that whilst working on overflow, and tried to mitigate by using switch/case having a single return in the functions (helps the compiler in pipelining stuff), but enhancements were not that big. I've ran the code through cachegrind and callgrind, but due to a lot of inlining it's quite hard to identify where time is spent (also I was focused on read_x implementation). The How did the check on buffer length made a 10/20% enhancement ? is the benchmark being fed with incompatible numbers (too large for buffer) ? meaning when you benchmark you do only with nominal cases, not erroneous ones right ? I'll have a closer look, thanks for involving me 👍 |
Also, before this changeset, the benchmarks had some inconsistencies and were testing different number sets; the figures I linked above are now consistent across benchmarks. |
Found it 🎉 Simply because atou is not inlined, the jumps are expensive. Before modification:
After modification (see patch, inlining revealed some issues in overflow detection, value is left untouched in error cases):
diff --git a/src/c4/charconv.hpp b/src/c4/charconv.hpp
index 8af02a3..297abc3 100644
--- a/src/c4/charconv.hpp
+++ b/src/c4/charconv.hpp
@@ -880,6 +880,7 @@ inline size_t atoi_first(csubstr str, T * C4_RESTRICT v)
*
* @see atou_first() if the string is not trimmed to the value to read. */
template<class T>
+C4_ALWAYS_INLINE
bool atou(csubstr str, T * C4_RESTRICT v)
{
C4_STATIC_ASSERT(std::is_integral<T>::value);
diff --git a/test/test_charconv.cpp b/test/test_charconv.cpp
index 423ca6b..c98d713 100644
--- a/test/test_charconv.cpp
+++ b/test/test_charconv.cpp
@@ -1216,7 +1216,7 @@ void test_no_overflows(std::initializer_list<const char *> args)
template<class T>
void test_overflows_hex()
{
- T x;
+ T x = 0;
std::string str;
if (std::is_unsigned<T>::value)
{
@@ -1259,7 +1259,7 @@ void test_overflows_hex()
template<class T>
void test_overflows_bin()
{
- T x;
+ T x = 0;
std::string str;
if (std::is_unsigned<T>::value)
{
@@ -1383,7 +1383,7 @@ TEST_CASE("overflows.u64")
{ /* with leading zeroes */
std::string str;
- uint64_t x;
+ uint64_t x = 1;
str = "0o01" + std::string(21, '7');
CHECK_MESSAGE(!overflows<uint64_t>(to_csubstr(str)), "num=" << str);
str = "0o02" + std::string(21, '0'); |
Amazing! Thanks! I'm going to try that. |
I have tried it and now everything is consistent: And the results are really nice: eg, for u32/i32 I am getting a speedup of 3x over 🥳 EDIT @fargies turns out these results were inflated because the benchmarks were optimized out. See below for up-to-date results. |
How would you explain that regarding |
I guess it's because of how much stuff is in there... every single character results in a virtual call, and it uses the locale which must be acquired from the OS, resulting in a mutex. |
Really interesting 👀 , thanks |
d2fd9f5
to
cb78b54
Compare
@fargies While looking closer at the results, all the c4 atox functions were suspiciously consistent (even with overflow check!). It turns out that the benchmark was optimized out of existence for those functions. I refactored the benchmark to create some side-effects. The results are still very good, but no longer out-of-this-world as they seemed above;
Eg, see this CI benchmark run (but open the link in another browser tab). |
That's still really decent results |
@fargies I'm revisiting the This is very likely to be a significant speedup, and therefore a very compelling reason to compute the digits as you were suggesting. Plus, it comes with the added advantage that the buffer size is no longer overestimated as you were objecting to before. But of course I'm going to try it first (with intrinsics and everything) and see how faster or slower it is. So I'm writing code to compute the number of digits, and I came back to the notes you made above:
I've tagged you again because I'm trying to figure out the octal expression you came to. Can you explain the reasoning? |
BTW, I also had to scratch my head for decimal; this is what I came up with, and seems to be working in preliminary testing: template<class T>
C4_CONSTEXPR14 unsigned digits_dec(T v) noexcept
{
C4_STATIC_ASSERT(std::is_integral<T>::value);
C4_ASSERT(v >= 0);
if(v)
{
// https://stackoverflow.com/a/18054857/5875572
unsigned t = ((unsigned)msb<T>(v) + 1u) * 1233u >> 12;
C4_ASSERT(t < detail::powers_of_10<T>::values_size);
return 1u + t - (v < detail::powers_of_10<T>::values[t]);
}
return 1u;
} |
That would remove the
For octal my assumptions were that there's simply 3 bits per symbol ? but now you made me doubting. For the decimals I had in mind doing the same computation than for hex which gives the most significant byte, and then bucket this for the 8 possible combinations, having max 2 comparisons per bucket: switch (msb<T>(v) / 8)
{
case 0:
if (v >= 100)
return 3;
return (v >= 10) ? 2 : 1;
case 1:
if (v >= 10000)
return 5;
return (v >= 1000) ? 4 : 3;
... But that may be a bit naïve. Regarding the code you've referenced, I'm not sure on how the obvious way compares to this (article is not so clear about it). |
78877e4
to
ec485bb
Compare
Codecov Report
@@ Coverage Diff @@
## master #77 +/- ##
==========================================
+ Coverage 94.09% 94.34% +0.24%
==========================================
Files 53 54 +1
Lines 11699 12251 +552
==========================================
+ Hits 11008 11558 +550
- Misses 691 693 +2
Continue to review full report at Codecov.
|
8fd9527
to
2fd5583
Compare
This last part was super hard. Glad to see finally see all tests green. |
No description provided.