Skip to content

Conversation

elichai
Copy link
Contributor

@elichai elichai commented Feb 3, 2020

Hi,
This is kinda chasing Concept ACK if it's worth the review effort.* (noticed it while reviewing a PR)

Right now both DecodeBase32 and DecodeBase64 are only ever being called with std::string but still process via char* and so they run strlen(O(n)) twice, once to validate it doesn't contain random zeros(ValidAsCString) another to get the length.

Instead we can iterate it as a std::string and then if we see 0 in the middle of the string we can fail the process.

* if it is, there's also DecodeBase58 .

And as a by-product saw 2 other places where we can pre-allocate memory.

EDIT: So the by-product turned out to be the only interesting part of this PR :) (see #18061 (comment))
According to the benchmarks this can cut the time it takes to parse a hex by half.

@elichai elichai changed the title util: Reimplement DecodeBase32/64 with std::string to remove strlen util: Change DecodeBase32/64 to use std::string to not call strlen twice Feb 3, 2020
@sanjaykdragon
Copy link
Contributor

ut: i think the compiler would optimize strlen being called twice into once

@elichai
Copy link
Contributor Author

elichai commented Feb 3, 2020

@sanjaykdragon good question. I'll check godbolt later. But even if you're right it's A. compiler dependent. B. still one time more than this.

EDIT: even with simplification, and force inlining everything it still calls strlen twice https://godbolt.org/z/ZBpdWa

Copy link
Contributor

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what about the check done in ValidAsCString?

I think you need to show improvements otherwise it doesn't pay off. FWIW having ValidAsCString avoids .reserve().

@@ -104,7 +106,7 @@ std::vector<unsigned char> ParseHex(const char* psz)

std::vector<unsigned char> ParseHex(const std::string& str)
{
return ParseHex(str.c_str());
return ParseHex(str.c_str(), str.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3fad849

NACK this change. Instead you can drop ParseHex(const char*) and move the code here with the proper changes - I've done it and make check ran fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we have places where it's called with const char*. probably by dropping that it just calls the std::string constructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion: can't we remove the std::vector<unsigned char> ParseHex(const char* psz); variant completely?
Does this cause any actual issues if we did?
I'd like to move away from C-strings where possible and it seems more of a simplification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only "problem" is a degregation in speed for places where we call this with C strings which I think is mostly(if not only?) tests(their raw strings usage)
It's probably worth the simplicity though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is copying a string actually relevant performance wise here? Even if it was, parsing hex should only happen on user or rpc inputs, so I don't think we need to over-optimize this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoFalke in my benchmarks if you have char* and you convert it into std::string it spends almost the same time as the code before this PR (X2 slower). I think it's not about copying, it's about heap allocation.
But if you prefer I'll be fine with removing the raw char* support because we probably only ever benefit from it in tests.

I agree that there's no need to over-optimize this but this is basically free optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing the request of simply dropping ParseHex(const char* psz).

Generally I think we should try move away from C-strings where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elichai Is there a single call site that needs the raw string interface? If not, and you have performance concerns, you may explicitly delete the interface, and thus force all callers to explicitly call ParseHex(std::string{psz}) and be aware of the performance impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked now, it seems like ParseHex(const char* psz) is only used in tests except for 2 places:
https://github.com/bitcoin/bitcoin/blob/master/src/chainparams.cpp#L55
https://github.com/bitcoin/bitcoin/blob/master/src/net_processing.cpp#L2101

Both should almost never really run, so I'll just drop it

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 3, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@elichai
Copy link
Contributor Author

elichai commented Feb 4, 2020

So added some benchmarks, and it seems to be within the margin of noise except for ParseHex.(which is twice as fast for the 32 byte case) so I think I'll rebase to only do the ParseHex pre-allocation
Before:

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.77748,  3.49936e-07, 3.60382e-07, 3.55597e-07
Base64DecodeVec, 5,       1000000,   1.69569,  3.35597e-07, 3.41911e-07, 3.39301e-07
HexParse,        5,       1000000,   0.599282, 1.1813e-07,  1.2082e-07,  1.19942e-07

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.84349,  3.6175e-07,  3.79702e-07, 3.65207e-07
Base64DecodeVec, 5,       1000000,   1.69212,  3.35303e-07, 3.43094e-07, 3.36332e-07
HexParse,        5,       1000000,   0.592378, 1.17877e-07, 1.19108e-07, 1.18541e-07

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.84158,  3.59822e-07, 3.83746e-07, 3.67486e-07
Base64DecodeVec, 5,       1000000,   1.7276,   3.37973e-07, 3.51833e-07, 3.46255e-07
HexParse,        5,       1000000,   0.614073, 1.18379e-07, 1.26827e-07, 1.22456e-07

After:

# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.81012,  3.55877e-07, 3.68592e-07, 3.63113e-07
Base64DecodeVec, 5,       1000000,   1.84868,  3.66254e-07, 3.72895e-07, 3.69377e-07
HexParse,        5,       1000000,   0.281846, 5.55183e-08, 5.71883e-08, 5.63974e-08

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.82559, 3.58871e-07, 3.68576e-07, 3.65803e-07
Base64DecodeVec, 5,       1000000,   1.84521, 3.6735e-07,  3.71235e-07, 3.68715e-07
HexParse,        5,       1000000,   0.29672, 5.62958e-08, 6.46866e-08, 5.86538e-08

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.89398,  3.7271e-07,  3.83227e-07, 3.78387e-07
Base64DecodeVec, 5,       1000000,   1.90284,  3.73173e-07, 3.89818e-07, 3.81565e-07
HexParse,        5,       1000000,   0.288767, 5.70536e-08, 5.92714e-08, 5.73934e-08

@elichai elichai force-pushed the 2020-02-remove-strlen branch from 349816a to 42af035 Compare February 4, 2020 15:56
@elichai elichai changed the title util: Change DecodeBase32/64 to use std::string to not call strlen twice util: Pass size to ParseHex to assist preallocation Feb 4, 2020
@sanjaykdragon
Copy link
Contributor

So added some benchmarks, and it seems to be within the margin of noise except for ParseHex.(which is twice as fast for the 32 byte case) so I think I'll rebase to only do the ParseHex pre-allocation
Before:

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.77748,  3.49936e-07, 3.60382e-07, 3.55597e-07
Base64DecodeVec, 5,       1000000,   1.69569,  3.35597e-07, 3.41911e-07, 3.39301e-07
HexParse,        5,       1000000,   0.599282, 1.1813e-07,  1.2082e-07,  1.19942e-07

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.84349,  3.6175e-07,  3.79702e-07, 3.65207e-07
Base64DecodeVec, 5,       1000000,   1.69212,  3.35303e-07, 3.43094e-07, 3.36332e-07
HexParse,        5,       1000000,   0.592378, 1.17877e-07, 1.19108e-07, 1.18541e-07

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.84158,  3.59822e-07, 3.83746e-07, 3.67486e-07
Base64DecodeVec, 5,       1000000,   1.7276,   3.37973e-07, 3.51833e-07, 3.46255e-07
HexParse,        5,       1000000,   0.614073, 1.18379e-07, 1.26827e-07, 1.22456e-07

After:

# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.81012,  3.55877e-07, 3.68592e-07, 3.63113e-07
Base64DecodeVec, 5,       1000000,   1.84868,  3.66254e-07, 3.72895e-07, 3.69377e-07
HexParse,        5,       1000000,   0.281846, 5.55183e-08, 5.71883e-08, 5.63974e-08

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.82559, 3.58871e-07, 3.68576e-07, 3.65803e-07
Base64DecodeVec, 5,       1000000,   1.84521, 3.6735e-07,  3.71235e-07, 3.68715e-07
HexParse,        5,       1000000,   0.29672, 5.62958e-08, 6.46866e-08, 5.86538e-08

$ ./src/bench/bench_bitcoin -filter="Base64Decode|Base64DecodeVec|HexParse"
# Benchmark,    evals,   iterations,  total,      min,         max,       median
Base64Decode,    5,       1000000,   1.89398,  3.7271e-07,  3.83227e-07, 3.78387e-07
Base64DecodeVec, 5,       1000000,   1.90284,  3.73173e-07, 3.89818e-07, 3.81565e-07
HexParse,        5,       1000000,   0.288767, 5.70536e-08, 5.92714e-08, 5.73934e-08

utACK: if these numbers are real and true, this is a nice improvement

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".

@DrahtBot
Copy link
Contributor

There hasn't been much activity lately and the patch still needs rebase. What is the status here?
  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

1 similar comment
@DrahtBot
Copy link
Contributor

There hasn't been much activity lately and the patch still needs rebase. What is the status here?
  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

@maflcko
Copy link
Member

maflcko commented Mar 21, 2022

Added "up for grabs", see also https://github.com/bitcoin/bitcoin/pull/23595/commits

@maflcko
Copy link
Member

maflcko commented Apr 4, 2022

I've split up the common changes (base refactoring work) into #24751

@fanquake
Copy link
Member

Closing this in favour of #24764 for now.

@maflcko
Copy link
Member

maflcko commented Apr 29, 2022

I don't think this was fixed by #24764

@maflcko maflcko reopened this Apr 29, 2022
@fanquake
Copy link
Member

I don't think this was fixed by #24764

I closed it because the PR has been inactive for a long time, the up for grabs label had already been applied, and it was just generating noise by conflicting with on-going changes. If someone wants to continue these changes they could rebase, and open a new PR, with the changes that are still relevant.

@fanquake fanquake closed this Apr 29, 2022
@bitcoin bitcoin locked and limited conversation to collaborators Apr 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants