Skip to content

Conversation

l2a5b1
Copy link
Contributor

@l2a5b1 l2a5b1 commented Jul 15, 2018

This pull request removes the boost/algorithm/string/case_conv.hpp dependency from the project.

boost/algorithm/string/case_conv.hpp is included for the boost::to_lower and boost::to_upper template functions.

We can replace the calls to these functions with straightforward alternative implementations that use the C++ Standard Library, because the functions are called with std::string objects that use standard 7-bit ASCII characters as argument.

The refactored implementation should work without the explicit static_cast<unsigned char> cast and unsigned char lambda return type. Both have been added defensively and to be explicit. Especially in case of the former, behaviour is undefined (potentially result in a crash) if the std::toupper argument is not an unsigned char.

A potential alternative, maybe even preferred, implementation to address the boost::to_lower function call in ParseNetwork(std::string) could have been:

if (net == "ipv4" || net == "IPv4") return NET_IPV4;
if (net == "ipv6" || net == "IPv6") return NET_IPV6;

This alternative implementation would however change the external behaviour of ParseNetwork(std::string).

This pull requests includes a unit test to validate the implementation of ParseNetwork(std::string) prior and after the removal of the case_conv.hpp dependency.

boost/algorithm/string/case_conv.hpp has been removed from the EXPECTED_BOOST_INCLUDES in test/lint/lint-includes.sh because it is no longer required.

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from c125aee to e5085d3 Compare July 15, 2018 23:19
@practicalswift
Copy link
Contributor

Concept ACK modulo …

  • Nice to see the EXPECTED_BOOST_INCLUDES array in test/lint/lint-includes.sh get shorter
  • This is a good opportunity to also shorten the KNOWN_VIOLATIONS array in test/lint/lint-locale-dependence.sh – please replace the use of locale dependent functions std::lower and std::upper with locale independent alternatives

Copy link
Contributor

@Empact Empact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to include <algorithm>/<ctype.h> where appropriate.

std::string firstLetter = category.substr(0,1);
boost::to_upper(firstLetter);
strRet += "== " + firstLetter + category.substr(1) + " ==\n";
unsigned char firstLetter = std::toupper(static_cast<unsigned char>(category[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::toupper(category.front()) should work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Empact, I have addressed your feedback. The code to capitalize the first letter of a string has been isolated in void Capitalize(std::string& str) in utilstrencodings.cpp.

src/netbase.cpp Outdated
@@ -38,7 +37,7 @@ static const int SOCKS5_RECV_TIMEOUT = 20 * 1000;
static std::atomic<bool> interruptSocks5Recv(false);

enum Network ParseNetwork(std::string net) {
boost::to_lower(net);
std::transform(net.begin(), net.end(), net.begin(), [](unsigned char c) -> unsigned char {return std::tolower(c);});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass tolower directly like so: std::transform(net.begin(), net.end(), net.begin(), tolower);

The same doesn't work with std::tolower fyi.

Copy link
Contributor

@Empact Empact Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cast like so, I believe that has the same effect: std::transform(net.begin(), net.end(), net.begin(), (int (*)(int)) std::tolower);

Note tolower accepts an int but handles it as an unsigned char: https://en.cppreference.com/w/c/string/byte/tolower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Empact!

I opted for the lambda expression, because it allowed me to pass unsigned char characters to the tolower function.

I understood from the referenced documentation that it doesn't handle the int as an unsigned char (character to be converted. If the value of ch is not representable as unsigned char and does not equal EOF, the behavior is undefined.).

I hope you like the new implementation better.

@Empact
Copy link
Contributor

Empact commented Jul 16, 2018

Concept ACK

@laanwj
Copy link
Member

laanwj commented Jul 16, 2018

  • is std::tolower locale-independent? (I don't think so, as @practicalswift mentions, so let's use this opportunity to write a locale-independent ASCII deterministic one)
  • I'd prefer to factor this out to our own strtolower function in utilstrencodings.cpp, instead of writing this out here

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 16, 2018

Note to reviewers: This pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@l2a5b1
Copy link
Contributor Author

l2a5b1 commented Jul 16, 2018

Many thanks for the review comments!

@laanwj

is std::tolower locale-independent?

As used in this project at the moment both std::tolower and std::toupper should be locale-independent and fully deterministic given that the arguments are std::string objects with characters in the standard 7-bit ASCII range.

  • In one instance std::toupper is used to uppercase the first character of CRPCCommand::category, which gets its value from arrays declared like the one below where category is a std::string that uses standard 7-bit ASCII characters:

bitcoin/src/rpc/blockchain.cpp

Lines 1923 to 1929 in 171b03d

static const CRPCCommand commands[] =
{ // category name actor (function) argNames
// --------------------- ------------------------ ----------------------- ----------
{ "blockchain", "getblockchaininfo", &getblockchaininfo, {} },
{ "blockchain", "getchaintxstats", &getchaintxstats, {"nblocks", "blockhash"} },
{ "blockchain", "getblockstats", &getblockstats, {"hash_or_height", "stats"} },
{ "blockchain", "getbestblockhash", &getbestblockhash, {} },

  • In the second instance it lowercases network names where the accepted values are limited to ["ipv4", "ipv6", "tor", "onion"] to catch a potential "IPv4" and "IPv6" input.

In this particular case I have also raised the suggestion to drop boost::to_lower(net) and replace it with:

if (net == "ipv4" || net == "IPv4") return NET_IPV4;
if (net == "ipv6" || net == "IPv6") return NET_IPV6;

I would even prefer this alternative if the modified external behaviour is acceptable.

@laanwj

I'd prefer to factor this out to our own strtolower function in utilstrencodings.cpp, instead of writing this out here

I had definitely considered this but felt that it would be a potential rabbit hole without guidance.

Implementing utility functions strtolower and strtoupper without the use of the std::tolower or std::toupper should be trivial if it is acceptable to limit their use to std::string objects with standard 7-bit ASCII characters.

Is this good enough for our purposes?

If the objective is to write custom functions that accept std::string objects with extended 8-bit ASCII characters, but can't use std::tolower or std::toupper, then I am not sure this is doable unless we simplify things tremendously. If the most important thing is to be deterministic, one thing that come to mind is doing conversions based on a single character encoding like ISO-8859-15 and use conversion tables that literally transform characters without considering different alphabets and language specific cases.

Either way, I am happy to give this another shot by addressing @Empact 's comments; or pursuing alternatives proposed by @practicalswift and @laanwj if I can get further guidance in which direction to take this.

@practicalswift
Copy link
Contributor

practicalswift commented Jul 16, 2018

@251labs For say the custom locale independent std::toupper function you would simply return any character outside of the ASCII range as is:

In the default "C" locale, the following lowercase letters abcdefghijklmnopqrstuvwxyz are replaced with respective uppercase letters ABCDEFGHIJKLMNOPQRSTUVWXYZ.

More generally: as long as your custom functions return the same value as the standard library equivalents would have done under the default "C" locale everything is fine.

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch 4 times, most recently from afd6b5e to 4388ad5 Compare July 17, 2018 23:53
@l2a5b1
Copy link
Contributor Author

l2a5b1 commented Jul 18, 2018

Thanks again for the feedback!

Per the review comments, I have implemented custom ToLower and ToUpper functions in 6b2d814 that are locale independent and ASCII deterministic.

I have tried to make the ToLower(unsigned char) and ToUpper(unsigned char) functions defined in utilstrencodings.h as efficient as possible. I hope I did not miss something (obvious) that makes it a failed attempt.

@@ -1202,4 +1202,63 @@ BOOST_AUTO_TEST_CASE(test_DirIsWritable)
fs::remove(tmpdirname);
}

BOOST_AUTO_TEST_CASE(test_ToLower)
{
BOOST_CHECK(ToLower('@') == '@');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use BOOST_CHECK_EQUAL(ToLower('@'), '@') (it gives a clearer diagnostic message in case of failure).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed in 0234af5

*/
inline unsigned char ToLower(unsigned char c)
{
static const unsigned char mappings[256] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of code!

What about return (c >= 'A' && c <= 'Z' ? (c - 'A') + 'a' : c);?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sipa,

I asked myself the same question and I am happy to change it!

Just to share why I decided to use a mappings array:

Support for extended ASCII

Should there be a time where it becomes important to have support for characters in the extended ASCII range then it will be trivial to implement it. It will then just be a matter of adjusting entries in mappings array vs extending the implementation with control statements like if (c == 'Ü') return 'ü'; or alternative logic.

Performance

I wanted our locale independent functions to be a good citizen in terms of performance for code that requires it.

I did some naive tests to get an impression what the code would compile to and it seemed that the compiler was able to generate more efficient code in the case return mappings[c];

At all optimization levels I would get a pointer to the mappings array where mappings are performed on:

lea rax, [rip + __ZZ7ToLowerhE15tolowermappings]
movzx   eax, byte ptr [rdi + rax]

Alternatively in the case of (c >= 'A' && c <= 'Z' ? (c - 'A') + 'a' : c) the compiler would generate something like this at O3 optimization:

    mov eax, edi 
    add al, -65 
    cmp al, 26
    jae LBB3_2
    add dil, 32
LBB3_2:
    movzx   eax, dil

Particularly in the case of iterations I felt the compiler was able to do better optimizations, for example minimizing iterations by operating in sequences. In the case of the former it could do 4 mapping operations per sequence; while in the latter of at most two mappings per iteration.

On this basis of this I assumed the code would also perform better, but I realize I can be very wrong because I am no expert at this.

I hope this feedback is useful. It it isn't, I will gladly change the code!

Copy link
Contributor

@practicalswift practicalswift Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer @sipa:s more readable shorter version. I'd bet any potential performance difference is insignificant in the grand scheme of things :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @practicalswift, will update the code per @sipa's feedback!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sipa. If we need to add extended ascii support, we can address that then, and this is not a performance-critical operation that calls for extra code to optimize it. Simple, repetitive code adds visual noise and can hide bugs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Empact, I will update the pull request!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the feedback. I have updated the pull request and addressed your comments in 53b3bfa.

* characters in the standard 7-bit ASCII range.
* @param[in,out] str the string to convert to uppercase.
*/
void ToUpper(std::string& str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's slightly confusing to have a version of ToUpper that operates on chars (and does not modify the argument) while the version operating on vectors does modify. Perhaps call this ConvertToUpper or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed in 0234af5

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from 4388ad5 to f296a93 Compare July 18, 2018 12:13
strRet += "== " + firstLetter + category.substr(1) + " ==\n";
std::string heading = category;
Capitalize(heading);
strRet += "== " + heading + " ==\n";
Copy link
Contributor

@Empact Empact Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would return the string from Capitalize and inline that into the string construction.

std::transform(str.begin(), str.end(), str.begin(), [](unsigned char c){return ToLower(c);});
}

void ConvertToUpper(std::string& str)
Copy link
Contributor

@Empact Empact Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mu-nit: Ruby uses the terms Downcase and Upcase for these operations, which I like. Just a suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should define ConvertToUpper if we're not using it. The construction can be trivially re-created from ConvertToLower if we need it, and in the meantime it's just dead code.

* @return the uppercase equivalent of c; or the argument
* if no conversion is possible.
*/
inline unsigned char ToUpper(unsigned char c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: constexpr

* @return the lowercase equivalent of c; or the argument
* if no conversion is possible.
*/
inline unsigned char ToLower(unsigned char c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: constexpr

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from f296a93 to 400dadf Compare July 18, 2018 15:03
BOOST_CHECK_EQUAL(ToLower('Z'), 'z');
BOOST_CHECK_EQUAL(ToLower('['), '[');
BOOST_CHECK_EQUAL(ToLower(0), 0);
BOOST_CHECK_EQUAL(ToLower(255), 255);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some ToUpper(...) tests as well :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @practicalswift! I was a bit too eager when I removed ToUpper(std::string&) and its tests (facepalm). Fixed in dff017f.

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from 400dadf to 76d70e5 Compare July 19, 2018 10:39
std::string firstLetter = category.substr(0,1);
boost::to_upper(firstLetter);
strRet += "== " + firstLetter + category.substr(1) + " ==\n";
strRet += "== " + Capitalize(category) + " ==\n";
Copy link
Contributor

@Empact Empact Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this modifies the category in-place, I think this breaks the comparison logic, e.g. on line 189.
Sorry, I think my original nit was misguided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize(…) (std::string Capitalize(std::string str)) does not modify category in-place :-)

Copy link
Contributor Author

@l2a5b1 l2a5b1 Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Empact, we should be good. When I addressed your feedback, I did change the method signature from void Capitalize(std::string&) to std::string Capitalize(std::string), which passes the argument by value (category is copied into str).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I misread - agree the argument is passed by value, so it's fine. 👍

@Empact
Copy link
Contributor

Empact commented Jul 19, 2018

utACK 76d70e5fb1923cdbc4771f438f10ae0039beccaa

@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from 76d70e5 to aa32fe3 Compare July 25, 2018 09:46
l2a5b1 added 3 commits August 28, 2018 18:37
This commit implements a unit test that validates the `ParseNetwork(std::string)` implementation in `netbase.cpp`.
This commit implements custom equivalents for the C and C++ `tolower` and `toupper` Standard Library functions.
In addition it implements a utility function to capitalize the first letter of a string.
This commit removes the `boost/algorithm/string/case_conv.hpp` dependency from the project. It replaces the `boost::to_lower` and `boost::to_upper` functions with custom functions that are locale independent and ASCII deterministic.
@l2a5b1 l2a5b1 force-pushed the patch/remove_boost_case_conv.hpp branch from aa32fe3 to b193d5a Compare August 28, 2018 16:45
@l2a5b1
Copy link
Contributor Author

l2a5b1 commented Aug 28, 2018

b193d5a has been rebased on master (aa39ca7), woot!

BOOST_CHECK_EQUAL(ParseNetwork("TOR"), NET_ONION);

BOOST_CHECK_EQUAL(ParseNetwork(":)"), NET_UNROUTABLE);
BOOST_CHECK_EQUAL(ParseNetwork("tÖr"), NET_UNROUTABLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, tÖr, that's like the Swedish version of Tor right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol @laanwj, best review comment ever! I do hidden messages or easter eggs from time to time, but this isn't one of them 😂😂

@laanwj
Copy link
Member

laanwj commented Aug 29, 2018

utACK b193d5a
certainly an improvement

@laanwj laanwj merged commit b193d5a into bitcoin:master Aug 29, 2018
kwvg pushed a commit to kwvg/dash that referenced this pull request May 22, 2021
Munkybooty pushed a commit to Munkybooty/dash that referenced this pull request Jun 30, 2021
…dependency

b193d5a Removes the Boost case_conv.hpp dependency. (251)
7a208d9 Implements custom tolower and toupper functions. (251)
e2ba043 Implements ParseNetwork unit test. (251)

Pull request description:

  This pull request removes the `boost/algorithm/string/case_conv.hpp` dependency from the project.

  `boost/algorithm/string/case_conv.hpp` is included for the `boost::to_lower` and `boost::to_upper` template functions.

  We can replace the calls to these functions with straightforward alternative implementations that use the C++ Standard Library, because the functions are called with `std::string` objects that use standard 7-bit ASCII characters as argument.

  The refactored implementation should work without the explicit `static_cast<unsigned char>` cast and `unsigned char` lambda return type. Both have been added defensively and to be explicit. Especially in case of the former, behaviour is undefined (potentially result in a crash) if the `std::toupper` argument is not an `unsigned char`.

  A potential alternative, maybe even preferred, implementation to address the `boost::to_lower` function call in `ParseNetwork(std::string)` could have been:

  ```c++
  if (net == "ipv4" || net == "IPv4") return NET_IPV4;
  if (net == "ipv6" || net == "IPv6") return NET_IPV6;
  ```
  This alternative implementation would however change the external behaviour of `ParseNetwork(std::string)`.

  This pull requests includes a unit test to validate the implementation of `ParseNetwork(std::string)`  prior and after the removal of the `case_conv.hpp` dependency.

  `boost/algorithm/string/case_conv.hpp` has been removed from the `EXPECTED_BOOST_INCLUDES` in `test/lint/lint-includes.sh` because it is no longer required.

Tree-SHA512: d803ae709f2368a3efb223097384a722436955bce0c44a1a5cffd0abb3164be0cce85ba0e9ebd9408166df3f1a95ea0c0d29e3a2534af2fae206c0419d67fde9

# Conflicts:
#	src/netbase.cpp
#	src/test/util_tests.cpp
#	src/util/strencodings.h
#	test/lint/lint-locale-dependence.sh
kwvg added a commit to kwvg/dash that referenced this pull request Jul 13, 2021
kwvg added a commit to kwvg/dash that referenced this pull request Jul 13, 2021
kwvg added a commit to kwvg/dash that referenced this pull request Jul 15, 2021
kwvg added a commit to kwvg/dash that referenced this pull request Jul 15, 2021
kwvg added a commit to kwvg/dash that referenced this pull request Jul 16, 2021
UdjinM6 added a commit to dashpay/dash that referenced this pull request Jul 16, 2021
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants