Skip to content

Conversation

hodlinator
Copy link
Contributor

@hodlinator hodlinator commented Jul 25, 2024

This PR is a follow-up to #30436.

Only changes test-code and modifies/adds comments.

Byte order of hex string representation was wrongfully documented as little-endian, but are in fact closer to "big-endian" (endianness is a memory-order concept rather than a numeric concept). [arith_]uint256 both store their data in arrays with little-endian byte order (arith_uint256 has host byte order within each uint32_t element).

uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. Credits to @maflcko: #30436 (comment)

setup_common.cpp - Skip needless ArithToUint256-conversion. Credits to @stickies-v: #30436 (comment)


Logical reasoning for endianness

  1. Comparing an arith_uint256 (base_uint<256>) to a uint64_t compares the beginning of the array, and verifies the remaining elements are zero.
template <unsigned int BITS>
bool base_uint<BITS>::EqualTo(uint64_t b) const
{
    for (int i = WIDTH - 1; i >= 2; i--) {
        if (pn[i])
            return false;
    }
    if (pn[1] != (b >> 32))
        return false;
    if (pn[0] != (b & 0xfffffffful))
        return false;
    return true;
}

...that is consistent with little endian ordering of the array.

  1. They have the same endianness (but arith_* has host-ordering of each uint32_t element):
arith_uint256 UintToArith256(const uint256 &a)
{
    arith_uint256 b;
    for(int x=0; x<b.WIDTH; ++x)
        b.pn[x] = ReadLE32(a.begin() + x*4);
    return b;
}

String conversions

The reversal of order which happens when converting hex-strings <=> uint256 means strings are actually closer to big-endian, see the end of base_blob<BITS>::SetHexDeprecated:

    unsigned char* p1 = m_data.data();
    unsigned char* pend = p1 + WIDTH;
    while (digits > 0 && p1 < pend) {
        *p1 = ::HexDigit(trimmed[--digits]);
        if (digits > 0) {
            *p1 |= ((unsigned char)::HexDigit(trimmed[--digits]) << 4);
            p1++;
        }
    }

Same reversal here:

template <unsigned int BITS>
std::string base_blob<BITS>::GetHex() const
{
    uint8_t m_data_rev[WIDTH];
    for (int i = 0; i < WIDTH; ++i) {
        m_data_rev[i] = m_data[WIDTH - 1 - i];
    }
    return HexStr(m_data_rev);
}

It now makes sense to me that SetHexDeprecated, upon receiving a shorter hex string that requires zero-padding, would pad as if the missing hex chars where towards the end of the little-endian byte array, as they are the most significant bytes. "Big-endian" string representation is also consistent with the case where SetHexDeprecated receives too many hex digits and discards the leftmost ones, as a form of integer narrowing takes place.

How I got it wrong in #30436

Previously I used the less than (<) comparison to prove endianness, but for uint256 it uses memcmp and thereby gives priority to the lower bytes at the beginning of the array.

    constexpr int Compare(const base_blob& other) const { return std::memcmp(m_data.data(), other.m_data.data(), WIDTH); }

arith_uint256 is different in that it begins by comparing the bytes from the end, as it is using little endian representation, where the bytes toward the end are more significant.

template <unsigned int BITS>
int base_uint<BITS>::CompareTo(const base_uint<BITS>& b) const
{
    for (int i = WIDTH - 1; i >= 0; i--) {
        if (pn[i] < b.pn[i])
            return -1;
        if (pn[i] > b.pn[i])
            return 1;
    }
    return 0;
}

(The commit documents that base_blob::Compare() is doing lexicographic ordering unlike the arith_*-variant which is doing numeric ordering).

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 25, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK paplorinc, ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #30560 (refactor: Add consteval uint256 constructor by hodlinator)
  • #30377 (refactor: Add consteval uint256{"str"} by hodlinator)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@sipa
Copy link
Member

sipa commented Jul 27, 2024

I'm not convinced about this change. The current code does have a number of confusing comments, but I think this confuses some things further.

"Big endian" and "little endian" describe ways to encode/decode numbers as bytes. Byte arrays themselves do not have an endianness (only their interpretation as number does), and it's even ambiguous to apply the term to hex encodings of bytes. Given an array of bytes being converted to hexadecimal, we can imagine:

  • A. Those bytes being interpreted as a number (according to some endianness), and then that number is converted to hexadecimal for human consumption (typically most significant hex character first).
  • B. Those bytes being interpreted as a sequence of single-byte numbers, which are individually converted to hex characters (typically most significant hex character first) and just concatenated.

The hex format used for uint256 could be described as:

  • (A) Converting to a number by interpreting as little-endian bytes, and then converting that number to hex with most significant character first.
  • The bytes are reversed, and then (B) individually converted to hex with most significant character first and concatenating.
  • (B) individually converting the bytes to hex with least significant character first and concatenating, followed by reversing the characters.

None of these descriptions are what I imagine when I read "big-endian hex".


I believe the most straightforward description is to treat uint256 (despite its name) not as representing a number at all, but as a simple wrapper around 32-byte arrays (if we'd move some logic of it to separate functions, I think it would be totally reasonable to replace it with using uint256 = std::array<std::byte, 32> even), and endianness is then just an inapplicable term.

  • The hex encoding for uint256 I would then describe as "byte-reversed hex encoding" (perhaps with an explainer "matching the interpretation of the bytes in little-endian as arith_uint256 does, followed by converting that to hex").
  • The comparison between uint256 would then be best described as "lexicographic ordering (note: this does not match the ordering on the corresponding arith_uint256 interpretation)".

On the other hand, arith_uint256 actually does implement a big integer type, which means:

  • Its byte encoding can properly be described as little-endian.
  • Its hex encoding can be described as "hex encoding of the number (with the most significant characters first)".
  • Its comparison is numeric comparison.

@l0rinc
Copy link
Contributor

l0rinc commented Jul 27, 2024

Thanks for your detailed reasoning, @sipa!

Documenting the endianness was my idea, and I appreciate your insights. I will investigate your claims in more detail later as I am currently traveling.

@l0rinc
Copy link
Contributor

l0rinc commented Jul 28, 2024

After further consideration and research, I believe I've identified a key point that may help clarify our discussion about endianness in the codebase.
We seem to agree that the base16 to base256 conversion can indeed have an endianness.
Based on the Wikipedia definition of endianness as "the order in which bytes within a word of digital data are transmitted over a data communication medium or addressed (by rising addresses) in computer memory, counting only byte significance compared to earliness", the resulting bytes (representing any data, especially if it's fixed size) can also be stored forward or backward, which aligns with this concept of significance compared to earliness. So according to this, arrays can also have endianness.

Currently, it seems we have a form of double endian encoding in our codebase, where:

  • The base16 to base256 conversion has an endianness.
  • The storage of the resulting bytes can also reflect endianness.

This double encoding might be the root of why we viewed this differently.

While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest:

Endianness terminology used for hexadecimal representations:

Endianness applied to non-numeric data:

uint256 treated as having specific endianness:

It also doesn't help that the uint256 getters (and testers like ArrayToString) themselves reverse the byte order before returning it (https://github.com/bitcoin/bitcoin/blob/master/src/uint256.cpp#L11-L18) and that reading uint64 chunks even considers the native endianness https://github.com/bitcoin/bitcoin/blob/master/src/uint256.h#L79.

The problem doesn't seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can't actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this?

@hodlinator hodlinator force-pushed the 2024-07_PR30436_follow_up branch from f01848f to 07de45a Compare July 29, 2024 16:21
@hodlinator
Copy link
Contributor Author

hodlinator commented Jul 29, 2024

Thank you for your feedback @paplorinc & @sipa!

I am convinced by sipa's argument that endianness is primarily an in-memory representation order. Numbers like 4607 (= 0x11FF) do not have an endianness to them, neither should strings containing them, even though it is very tempting for hexadecimal.

07de45a contains much of sipa's input. Not sure what to do with pre-existing code @paplorinc found that uses endianness terminology imprecisely.

Transaction IDs: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L620

The RPC specifies txid as "transaction id encoded in little-endian hexadecimal".

I dug into the behavior of the HashWriter used in CTransaction::ComputeHash() to compute txids:

diff --git a/src/test/hash_tests.cpp b/src/test/hash_tests.cpp
index 51f1d4c840..590af57ff6 100644
--- a/src/test/hash_tests.cpp
+++ b/src/test/hash_tests.cpp
@@ -2,6 +2,7 @@
 // Distributed under the MIT software license, see the accompanying
 // file COPYING or http://www.opensource.org/licenses/mit-license.php.
 
+#include "uint256.h"
 #include <clientversion.h>
 #include <crypto/siphash.h>
 #include <hash.h>
@@ -146,6 +147,10 @@ BOOST_AUTO_TEST_CASE(siphash)
         BOOST_CHECK_EQUAL(SipHashUint256(k1, k2, x), sip256.Finalize());
         BOOST_CHECK_EQUAL(SipHashUint256Extra(k1, k2, x, n), sip288.Finalize());
     }
+
+    HashWriter writer{};
+    writer.write(std::vector<std::byte>{std::byte{'1'}, std::byte{'2'}, std::byte{'3'}});
+    BOOST_CHECK_EQUAL(writer.GetSHA256(), uint256S("a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"));
 }
 
 BOOST_AUTO_TEST_SUITE_END()

The "a665a45.."-string was provided by 2 online SHA256 calculators hashing "123" as the code above.

Result:

test/hash_tests.cpp(153): error: in "hash_tests/siphash": check writer.GetSHA256() == uint256S("a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3") has failed
[e37aa2f7f7868e997ea01fff3f1f4aa0b84fdcef67487e419d2f422059a465a6 !=
 a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3]

Conclusion: Byte order is unnecessarily reversed in this case, but now hard to change. Should probably be documented as "transaction id encoded in reverse-order hexadecimal" but this PR is just concerned with shortcomings of #30436.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 07de45a. I do think this is an improvement over the status quo, but I think a lot of things were less clear than they could be so I left a lot of suggestions.

Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 07de45a

I love that we're striving to making the campground cleaner than we found it, we need more of these kind of changes <3
Agree with @ryanofsky, some of the comments could still use some clarification, but I'm fine with it as-is as well.

src/uint256.h Outdated
/** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated */
/** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated!
*
* Hex strings that don't specify enough bytes to fill the internal array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the hex strings don't actually provide bytes, the bytes are computed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They specify which bytes are to be computed? :) Maybe there's some other way of making it clearer.

@hodlinator hodlinator force-pushed the 2024-07_PR30436_follow_up branch 2 times, most recently from f352d9c to 0828a36 Compare August 1, 2024 21:20
@DrahtBot
Copy link
Contributor

DrahtBot commented Aug 1, 2024

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/28237349796

Hints

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@hodlinator
Copy link
Contributor Author

Latest force push to fix CI failure: https://github.com/bitcoin/bitcoin/pull/30526/checks?check_run_id=28237349796

@DrahtBot DrahtBot removed the CI failed label Aug 1, 2024
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 0828a36. Thanks for considering the suggestions, and I like the way you now grouped the reverse-hex functions together and added an overall description of the representation.

I do think it would be better if the base_uint class try to didn't try to document the array by describing its memory layout on little endian machines. I think it would be clearer if it just said what order the array elements are in.

@DrahtBot DrahtBot requested a review from l0rinc August 1, 2024 22:55
@l0rinc
Copy link
Contributor

l0rinc commented Aug 2, 2024

ACK 0828a36

Follow-up to bitcoin#30436.

uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that.

uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side.

setup_common.cpp - Skip needless ArithToUint256-conversion.
@hodlinator hodlinator force-pushed the 2024-07_PR30436_follow_up branch from 0828a36 to 73e3fa1 Compare August 3, 2024 20:02
Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 73e3fa1

@DrahtBot DrahtBot requested a review from ryanofsky August 3, 2024 20:11
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK 73e3fa1

Just tweaked code comments since last review. I think this can be merged even though from the review comments it seems like there could be more more followups later. Everything here now seems like an improvement over the status quo.

@ryanofsky ryanofsky merged commit 1a7d205 into bitcoin:master Aug 5, 2024
16 checks passed
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged with only 2 acks, since this is a documentation & test change.

It does seem like there are some possible followups here:

  1. Providing more context for reverse-byte hex representation #30526 (comment)
  2. Improving RPC documentation as suggested by paplorinc in #30526 (comment)
  3. Documenting base_uint::pn array #30526 (comment)
  4. Maybe adding stricter tests for comparison functions #30526 (comment) and #30526 (comment)

re: #30526 (comment)

While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest:

Endianness terminology used for hexadecimal representations:

These are good finds and I think hodlinator's suggestion to document these as "transaction id encoded in reverse-order hexadecimal" could an improvement, although maybe it is ambiguous. "Reverse-order hexadecimal" sounds like the hexadecimal string might be reversed, rather than the bytes represented by the hexadecimal string being reversed. I might suggest changing documentation of these fields to:

  • txid hash with bytes reversed, encoded in hex
  • wtxid hash with bytes reversed, encoded in hex

but maybe something else would be better.

Endianness applied to non-numeric data:

I think this comment is using "big-endian" correctly, though maybe not very clearly. The idea behind base58 encoding is that you take a byte array, interpret the byte array as a big-endian number, and display that number in base-58 (with a special prefix of 1's if the original byte array began with 0's). (The python version of this function is much easier to understand, for reference.)

This seems like a totally correct comment. This function is taking bytes and converting them to an integer. Seems like a straightforward case of using "endianness" to describe byte order of numeric data.

This also seems correct. The siphash class was two Write methods. There is a Write method take bytes as input, and another Write method that takes 64 bit numbers as input. The documentation is just saying if you give that one a number, the little endian representation of the number is what is hashed.

uint256 treated as having specific endianness:

This seems like a borderline case and probably would be good to clarify. It's just describing the way the wtxids are sorted before they are hashed. Instead of the implementation sorting the wtxid so they are in lexigraphic order, it sorts them so that the reversed bytes of the wtxid's would be in lexicographic order. This is equivalent to treating the wtxid bytes as little endian numbers and sorting the numbers.

It also doesn't help that the uint256 getters (and testers like ArrayToString) themselves reverse the byte order before returning it (

bitcoin/src/uint256.cpp

Lines 11 to 18 in 5d28013

std::string base_blob<BITS>::GetHex() const
{
uint8_t m_data_rev[WIDTH];
for (int i = 0; i < WIDTH; ++i) {
m_data_rev[i] = m_data[WIDTH - 1 - i];
}
return HexStr(m_data_rev);
}
)

This isn't great but I think it's baked in and the best thing we can do is document it.

and that reading uint64 chunks even considers the native endianness

constexpr uint64_t GetUint64(int pos) const { return ReadLE64(m_data.data() + pos * 8); }
.

This should be fine, because it is only using native endianness as an optimization, and does the same thing on big and little endian systems. It is just interpreting bytes of the blob as a little endian number, and on little endian systems this can be done efficiently with memcpy.

The problem doesn't seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can't actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this?

I think we can clean up documentation various places, name functions better, and maybe in the long term move towards sipa' suggestion to replace these classes with using uint256 = std::array<std::byte, 32>.

// Hex string representations are little-endian.
/** @name Hex representation
*
* The reverse-byte hex representation is a convenient way to view the blob
Copy link
Contributor

@ryanofsky ryanofsky Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "doc + test: Correct uint256 hex string endianness" (73e3fa1)

I think this documentation could use an introduction that gives some context. Would change to:

  • The hex representation used by GetHex(), ToString(), FromHex() and ParseHexDeprecated() is unconventional, as it shows bytes in reverse order. For instance, a 32-bit blob {0x12, 0x34, 0x56, 0x78} is represented as "78563412" instead of the more typical "12345678" format used by tools like Unix's xxd or Python's bytes.hex() and bytes.fromhex() functions to represent the same binary data.

    The reverse-byte hex representation of the blob is equivalent to treating the blob as a little-endian number, and then displaying that number in base 16. So it is consistent with the way the base_uint class converts blobs to numbers, and could be a convenient way to view blobs that are numbers.

@hodlinator hodlinator deleted the 2024-07_PR30436_follow_up branch August 5, 2024 06:46
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jan 2, 2025
This is a documentation-only change following up on suggestions made in the
bitcoin#30526 review.

Motivation for this change is that I was recently reviewing bitcoin#31583, which
reminded me how confusing the arithmetic blob code was and made me want to
write better comments.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jan 3, 2025
This is a documentation-only change following up on suggestions made in the
bitcoin#30526 review.

Motivation for this change is that I was recently reviewing bitcoin#31583, which
reminded me how confusing the arithmetic blob code was and made me want to
write better comments.
achow101 added a commit that referenced this pull request Jan 6, 2025
3e0a992 doc: Clarify comments about endianness after #30526 (Ryan Ofsky)

Pull request description:

  This is a documentation-only change following up on suggestions made in the #30526 review.

  Motivation for this change is that I was recently reviewing #31583, which reminded me how confusing the arithmetic blob code was and made me want to write better comments.

ACKs for top commit:
  achow101:
    ACK 3e0a992
  TheCharlatan:
    ACK 3e0a992
  Sjors:
    ACK 3e0a992
  BrandonOdiwuor:
    LGTM ACK 3e0a992

Tree-SHA512: 90d5582a25a51fc406d83ca6b8c4e5e4d3aee828a0497f4b226b2024ff89e29b9b50d0ae8ddeac6abf2757fe78548d58cf3dd54df4b6d623f634a2106048091d
ismaelsadeeq pushed a commit to ismaelsadeeq/bitcoin that referenced this pull request Jan 7, 2025
This is a documentation-only change following up on suggestions made in the
bitcoin#30526 review.

Motivation for this change is that I was recently reviewing bitcoin#31583, which
reminded me how confusing the arithmetic blob code was and made me want to
write better comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants