-
Notifications
You must be signed in to change notification settings - Fork 37.7k
doc: Correct uint256 hex string endianness #30526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
7f71275
to
f01848f
Compare
I'm not convinced about this change. The current code does have a number of confusing comments, but I think this confuses some things further. "Big endian" and "little endian" describe ways to encode/decode numbers as bytes. Byte arrays themselves do not have an endianness (only their interpretation as number does), and it's even ambiguous to apply the term to hex encodings of bytes. Given an array of bytes being converted to hexadecimal, we can imagine:
The hex format used for
None of these descriptions are what I imagine when I read "big-endian hex". I believe the most straightforward description is to treat
On the other hand,
|
Thanks for your detailed reasoning, @sipa! Documenting the endianness was my idea, and I appreciate your insights. I will investigate your claims in more detail later as I am currently traveling. |
After further consideration and research, I believe I've identified a key point that may help clarify our discussion about endianness in the codebase. Currently, it seems we have a form of double endian encoding in our codebase, where:
This double encoding might be the root of why we viewed this differently. While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest: Endianness terminology used for hexadecimal representations:
Endianness applied to non-numeric data:
uint256 treated as having specific endianness: It also doesn't help that the The problem doesn't seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can't actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this? |
f01848f
to
07de45a
Compare
Thank you for your feedback @paplorinc & @sipa! I am convinced by sipa's argument that endianness is primarily an in-memory representation order. Numbers like 4607 (= 0x11FF) do not have an endianness to them, neither should strings containing them, even though it is very tempting for hexadecimal. 07de45a contains much of sipa's input. Not sure what to do with pre-existing code @paplorinc found that uses endianness terminology imprecisely.
The RPC specifies txid as "transaction id encoded in little-endian hexadecimal". I dug into the behavior of the diff --git a/src/test/hash_tests.cpp b/src/test/hash_tests.cpp
index 51f1d4c840..590af57ff6 100644
--- a/src/test/hash_tests.cpp
+++ b/src/test/hash_tests.cpp
@@ -2,6 +2,7 @@
// Distributed under the MIT software license, see the accompanying
// file COPYING or http://www.opensource.org/licenses/mit-license.php.
+#include "uint256.h"
#include <clientversion.h>
#include <crypto/siphash.h>
#include <hash.h>
@@ -146,6 +147,10 @@ BOOST_AUTO_TEST_CASE(siphash)
BOOST_CHECK_EQUAL(SipHashUint256(k1, k2, x), sip256.Finalize());
BOOST_CHECK_EQUAL(SipHashUint256Extra(k1, k2, x, n), sip288.Finalize());
}
+
+ HashWriter writer{};
+ writer.write(std::vector<std::byte>{std::byte{'1'}, std::byte{'2'}, std::byte{'3'}});
+ BOOST_CHECK_EQUAL(writer.GetSHA256(), uint256S("a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"));
}
BOOST_AUTO_TEST_SUITE_END() The "a665a45.."-string was provided by 2 online SHA256 calculators hashing "123" as the code above. Result:
Conclusion: Byte order is unnecessarily reversed in this case, but now hard to change. Should probably be documented as "transaction id encoded in reverse-order hexadecimal" but this PR is just concerned with shortcomings of #30436. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 07de45a. I do think this is an improvement over the status quo, but I think a lot of things were less clear than they could be so I left a lot of suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 07de45a
I love that we're striving to making the campground cleaner than we found it, we need more of these kind of changes <3
Agree with @ryanofsky, some of the comments could still use some clarification, but I'm fine with it as-is as well.
src/uint256.h
Outdated
/** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated */ | ||
/** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated! | ||
* | ||
* Hex strings that don't specify enough bytes to fill the internal array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the hex strings don't actually provide bytes, the bytes are computed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They specify which bytes are to be computed? :) Maybe there's some other way of making it clearer.
f352d9c
to
0828a36
Compare
🚧 At least one of the CI tasks failed. HintsMake sure to run all tests locally, according to the documentation. The failure may happen due to a number of reasons, for example:
Leave a comment here, if you need help tracking down a confusing failure. |
Latest force push to fix CI failure: https://github.com/bitcoin/bitcoin/pull/30526/checks?check_run_id=28237349796 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 0828a36. Thanks for considering the suggestions, and I like the way you now grouped the reverse-hex functions together and added an overall description of the representation.
I do think it would be better if the base_uint class try to didn't try to document the array by describing its memory layout on little endian machines. I think it would be clearer if it just said what order the array elements are in.
ACK 0828a36 |
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
0828a36
to
73e3fa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 73e3fa1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 73e3fa1
Just tweaked code comments since last review. I think this can be merged even though from the review comments it seems like there could be more more followups later. Everything here now seems like an improvement over the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged with only 2 acks, since this is a documentation & test change.
It does seem like there are some possible followups here:
- Providing more context for reverse-byte hex representation #30526 (comment)
- Improving RPC documentation as suggested by paplorinc in #30526 (comment)
- Documenting base_uint::pn array #30526 (comment)
- Maybe adding stricter tests for comparison functions #30526 (comment) and #30526 (comment)
re: #30526 (comment)
While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest:
Endianness terminology used for hexadecimal representations:
These are good finds and I think hodlinator's suggestion to document these as "transaction id encoded in reverse-order hexadecimal" could an improvement, although maybe it is ambiguous. "Reverse-order hexadecimal" sounds like the hexadecimal string might be reversed, rather than the bytes represented by the hexadecimal string being reversed. I might suggest changing documentation of these fields to:
- txid hash with bytes reversed, encoded in hex
- wtxid hash with bytes reversed, encoded in hex
but maybe something else would be better.
Endianness applied to non-numeric data:
- Base58 encoding:
Line 98 in 5d28013
// Allocate enough space in big-endian base58 representation.
I think this comment is using "big-endian" correctly, though maybe not very clearly. The idea behind base58 encoding is that you take a byte array, interpret the byte array as a big-endian number, and display that number in base-58 (with a special prefix of 1's if the original byte array began with 0's). (The python version of this function is much easier to understand, for reference.)
- CRC32:
bitcoin/src/crc32c/src/crc32c_read_le.h
Line 17 in 5d28013
// Reads a little-endian 32-bit integer from a 32-bit-aligned buffer.
This seems like a totally correct comment. This function is taking bytes and converting them to an integer. Seems like a straightforward case of using "endianness" to describe byte order of numeric data.
- SipHash:
Line 25 in 5d28013
* It is treated as if this was the little-endian interpretation of 8 bytes.
This also seems correct. The siphash class was two Write methods. There is a Write method take bytes as input, and another Write method that takes 64 bit numbers as input. The documentation is just saying if you give that one a number, the little endian representation of the number is what is hashed.
uint256 treated as having specific endianness:
- In policy code:
Line 93 in 5d28013
* wtxids as little endian encoded uint256, smallest to largest). */
This seems like a borderline case and probably would be good to clarify. It's just describing the way the wtxids are sorted before they are hashed. Instead of the implementation sorting the wtxid so they are in lexigraphic order, it sorts them so that the reversed bytes of the wtxid's would be in lexicographic order. This is equivalent to treating the wtxid bytes as little endian numbers and sorting the numbers.
It also doesn't help that the
uint256
getters (and testers likeArrayToString
) themselves reverse the byte order before returning it ()Lines 11 to 18 in 5d28013
std::string base_blob<BITS>::GetHex() const { uint8_t m_data_rev[WIDTH]; for (int i = 0; i < WIDTH; ++i) { m_data_rev[i] = m_data[WIDTH - 1 - i]; } return HexStr(m_data_rev); }
This isn't great but I think it's baked in and the best thing we can do is document it.
and that reading
uint64
chunks even considers the native endianness.Line 79 in 5d28013
constexpr uint64_t GetUint64(int pos) const { return ReadLE64(m_data.data() + pos * 8); }
This should be fine, because it is only using native endianness as an optimization, and does the same thing on big and little endian systems. It is just interpreting bytes of the blob as a little endian number, and on little endian systems this can be done efficiently with memcpy.
The problem doesn't seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can't actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this?
I think we can clean up documentation various places, name functions better, and maybe in the long term move towards sipa' suggestion to replace these classes with using uint256 = std::array<std::byte, 32>
.
// Hex string representations are little-endian. | ||
/** @name Hex representation | ||
* | ||
* The reverse-byte hex representation is a convenient way to view the blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "doc + test: Correct uint256 hex string endianness" (73e3fa1)
I think this documentation could use an introduction that gives some context. Would change to:
-
The hex representation used by
GetHex()
,ToString()
,FromHex()
andParseHexDeprecated()
is unconventional, as it shows bytes in reverse order. For instance, a 32-bit blob{0x12, 0x34, 0x56, 0x78}
is represented as"78563412"
instead of the more typical"12345678"
format used by tools like Unix'sxxd
or Python'sbytes.hex()
andbytes.fromhex()
functions to represent the same binary data.The reverse-byte hex representation of the blob is equivalent to treating the blob as a little-endian number, and then displaying that number in base 16. So it is consistent with the way the base_uint class converts blobs to numbers, and could be a convenient way to view blobs that are numbers.
This is a documentation-only change following up on suggestions made in the bitcoin#30526 review. Motivation for this change is that I was recently reviewing bitcoin#31583, which reminded me how confusing the arithmetic blob code was and made me want to write better comments.
This is a documentation-only change following up on suggestions made in the bitcoin#30526 review. Motivation for this change is that I was recently reviewing bitcoin#31583, which reminded me how confusing the arithmetic blob code was and made me want to write better comments.
3e0a992 doc: Clarify comments about endianness after #30526 (Ryan Ofsky) Pull request description: This is a documentation-only change following up on suggestions made in the #30526 review. Motivation for this change is that I was recently reviewing #31583, which reminded me how confusing the arithmetic blob code was and made me want to write better comments. ACKs for top commit: achow101: ACK 3e0a992 TheCharlatan: ACK 3e0a992 Sjors: ACK 3e0a992 BrandonOdiwuor: LGTM ACK 3e0a992 Tree-SHA512: 90d5582a25a51fc406d83ca6b8c4e5e4d3aee828a0497f4b226b2024ff89e29b9b50d0ae8ddeac6abf2757fe78548d58cf3dd54df4b6d623f634a2106048091d
This is a documentation-only change following up on suggestions made in the bitcoin#30526 review. Motivation for this change is that I was recently reviewing bitcoin#31583, which reminded me how confusing the arithmetic blob code was and made me want to write better comments.
This PR is a follow-up to #30436.
Only changes test-code and modifies/adds comments.
Byte order of hex string representation was wrongfully documented as little-endian, but are in fact closer to "big-endian" (endianness is a memory-order concept rather than a numeric concept).
[arith_]uint256
both store their data in arrays with little-endian byte order (arith_uint256
has host byte order within eachuint32_t
element).uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. Credits to @maflcko: #30436 (comment)
setup_common.cpp - Skip needless ArithToUint256-conversion. Credits to @stickies-v: #30436 (comment)
Logical reasoning for endianness
arith_uint256
(base_uint<256>
) to auint64_t
compares the beginning of the array, and verifies the remaining elements are zero....that is consistent with little endian ordering of the array.
arith_*
has host-ordering of eachuint32_t
element):String conversions
The reversal of order which happens when converting hex-strings <=> uint256 means strings are actually closer to big-endian, see the end of
base_blob<BITS>::SetHexDeprecated
:Same reversal here:
It now makes sense to me that
SetHexDeprecated
, upon receiving a shorter hex string that requires zero-padding, would pad as if the missing hex chars where towards the end of the little-endian byte array, as they are the most significant bytes. "Big-endian" string representation is also consistent with the case whereSetHexDeprecated
receives too many hex digits and discards the leftmost ones, as a form of integer narrowing takes place.How I got it wrong in #30436
Previously I used the less than (
<
) comparison to prove endianness, but foruint256
it usesmemcmp
and thereby gives priority to the lower bytes at the beginning of the array.arith_uint256
is different in that it begins by comparing the bytes from the end, as it is using little endian representation, where the bytes toward the end are more significant.(The commit documents that
base_blob::Compare()
is doing lexicographic ordering unlike thearith_*
-variant which is doing numeric ordering).