Skip to content

Conversation

martinus
Copy link
Contributor

This is a draft PR to show various possible optimizations for CCoinsMap. In my benchmark, all these changes lead to a statistically significant speed improvement for -reindex-chainstate.

Benchmark 1: ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = ac188b573c8)
  Time (mean ± σ):     2089.253 s ± 23.737 s    [User: 2110.111 s, System: 299.197 s]
  Range (min … max):   2062.751 s … 2108.561 s    3 runs

Benchmark 2: ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = 7113095b3cd)
  Time (mean ± σ):     2431.028 s ± 17.544 s    [User: 2439.746 s, System: 284.994 s]
  Range (min … max):   2415.408 s … 2450.009 s    3 runs

Benchmark 3: ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = 7370e13d93d)
  Time (mean ± σ):     2530.955 s ± 21.575 s    [User: 2539.882 s, System: 285.890 s]
  Range (min … max):   2512.525 s … 2554.687 s    3 runs

Benchmark 4: ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = e568c1dd134)
  Time (mean ± σ):     2839.864 s ±  9.993 s    [User: 2850.112 s, System: 287.916 s]
  Range (min … max):   2830.073 s … 2850.047 s    3 runs

Summary
  ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = ac188b573c8) ran
    1.16 ± 0.02 times faster than ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = 7113095b3cd)
    1.21 ± 0.02 times faster than ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = 7370e13d93d)
    1.36 ± 0.02 times faster than ./build/bin/bitcoind -stopatheight=600000 -dbcache=5000 -printtoconsole=0 -reindex-chainstate -noconnect -connect=192.168.8.118 (COMMIT = e568c1dd134)

SipHashUint256Extra is rather slow. For the purpose of generating a hash
from a COutPoint and some seed that is only used inside a hashmap, it is
sufficient to use a non-cryptographic hash. rapidhash [1] is a well tested
and very fast hash function. This implementation strips down this hash
function, originally implemented for a memory buffer, to be used with
the COutPoint + 2*64bit seed as the input.

[1] https://github.com/Nicoshev/rapidhash
In the case when parent cache does not have an entry while child cache does,
this reduces the double hash lookup (find + try_emplace) with a single
try_emplace.
@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 24, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32128.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32313 (coins: fix cachedCoinsUsage accounting in CCoinsViewCache by l0rinc)
  • #32043 ([IBD] Tracking PR for speeding up Initial Block Download by l0rinc)
  • #31895 (doc: Improve dependencies.md by NicolaLS)
  • #30442 ([IBD] precalculate SipHash constant salt calculations by l0rinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/39270918766

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome back @martinus, we missed you! :)
I will measure these changes separately until 890k blocks soon.
I have included this change a tracking PR where we have other similar experiments: #32043

Edit: is the name of the branch an easter egg for what to expect next?

std::equal_to<COutPoint>,
PoolAllocator<CoinsCachePair,
sizeof(CoinsCachePair) + sizeof(void*) * 4>>;
using CCoinsMap = boost::unordered_node_map<COutPoint,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we're trying to get away from external dependencies, so if this does improve performance, we may want to copy it over.

For the record, similar attempts have been made before: #22640

@andrewtoth suggested we try your unordered_dense here as well.

My understanding is that this will likely require retesting our memory related assumption so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think boost maps might be a better choice, I believe they are better tested, and might be faster as well. Also for boost::unordered_node_map references are stable, which is not the case for unordered_dense, so it is easier to have this as a drop-in replacement without worrying about pointer stability.

CCoinsMap::iterator itUs;
bool isInserted = false;
if (!(it->second.IsFresh() && it->second.coin.IsSpent())) {
std::tie(itUs, isInserted) = cacheCoins.try_emplace(it->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems similar to #30326 - I'm surprised this has a measurable effect (and that I haven't found it so far :D)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it has far less of an effect in your test, maybe I should also run the benchmark on my machine until block 890k for comparison

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe these are only measurable after the other changes are in place - I have compared all commits independently.

@@ -44,5 +44,6 @@ class CSipHasher
*/
uint64_t SipHashUint256(uint64_t k0, uint64_t k1, const uint256& val);
uint64_t SipHashUint256Extra(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra);
uint64_t ModifiedRapidHash(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra);
Copy link
Contributor

@l0rinc l0rinc Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also measured this being a bottleneck very often (#30442), and @sipa mentioned originally that we can swap it out with something better later: #8020 (comment)

Do we need to account for collision attacks here, given that inclusion into the map is not for free?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collision attacks can't be an issue because the hash implementation still uses the random seed, and the hash itself is of good quality.

Copy link
Member

@sipa sipa Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty skeptical about that claim. If the hash isn't of cryptographic quality, I don't know how I'd reason about an attacker's ability to control the buckets, even with a secret random seed.

To be clear, it's entirely reasonable that this may be safe, but I don't know how I'd convince myself that it is definitely safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sipa thanks, you are of course right; I should have read a bit more about that before... Nevertheless I think it's at least interesting to see what difference a fast hash would make.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipHash is decent (I tried several other candidates and most were slower), but not particularly fast (even a general SHA256-shani is in the same ball-park).
RapidHash is much simpler and, on short inputs, dramatically faster:

ns/byte byte/s err% total benchmark
0.06 16,529,344,089.83 0.2% 11.00 RapidHash_32b
0.96 1,043,368,577.00 0.1% 11.01 Sha256Shani_32b using the 'arm_shani(1way,2way)' SHA256 implementation
0.39 2,547,686,515.00 0.2% 10.98 SipHash_32b
Benchmarks
diff --git a/src/bench/crypto_hash.cpp b/src/bench/crypto_hash.cpp
index 2f1ff56438..cab34169a5 100644
--- a/src/bench/crypto_hash.cpp
+++ b/src/bench/crypto_hash.cpp
@@ -195,11 +195,108 @@ static void SipHash_32b(benchmark::Bench& bench)
     FastRandomContext rng{/*fDeterministic=*/true};
     auto k0{rng.rand64()}, k1{rng.rand64()};
     auto val{rng.rand256()};
+    auto extra{rng.rand32()};
     auto i{0U};
-    bench.run([&] {
-        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, val));
+    bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+        ankerl::nanobench::doNotOptimizeAway(SipHashUint256Extra(k0, k1, val, extra));
         ++k0;
         ++k1;
+        ++extra;
+        ++i;
+        val.data()[i % uint256::size()] ^= i & 0xFF;
+    });
+}
+
+static void Sha256Shani_32b(benchmark::Bench& bench)
+{
+    FastRandomContext rng{/*fDeterministic=*/true};
+    auto k0{rng.rand64()}, k1{rng.rand64()};
+    auto val{rng.rand256()};
+    auto extra{rng.rand32()};
+    auto i{0U};
+    uint8_t hash[CSHA256::OUTPUT_SIZE];
+
+    bench.name(strprintf("%s using the '%s' SHA256 implementation", __func__, SHA256AutoDetect(sha256_implementation::USE_SSE4_AND_SHANI)));
+    uint8_t digest[CSHA256::OUTPUT_SIZE];
+    bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+        CSHA256()
+            .Write(reinterpret_cast<const uint8_t*>(&k0), sizeof(k0))
+            .Write(reinterpret_cast<const uint8_t*>(&k1), sizeof(k1))
+            .Write(val.data(), val.size())
+            .Write(reinterpret_cast<const uint8_t*>(&extra), sizeof(extra))
+            .Finalize(hash);
+
+        ankerl::nanobench::doNotOptimizeAway(digest[0]);
+        ++k0;
+        ++k1;
+        ++extra;
+        ++i;
+        val.data()[i % uint256::size()] ^= i & 0xFF;
+    });
+    SHA256AutoDetect();
+}
+
+uint64_t ModifiedRapidHash(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra)
+{
+    auto const rapid_mum = [](uint64_t* a, uint64_t* b) {
+#if defined(__SIZEOF_INT128__)
+        __uint128_t r = *a;
+        r *= *b;
+        *a = (uint64_t)r;
+        *b = (uint64_t)(r >> 64);
+#elif defined(_MSC_VER) && (defined(_WIN64) || defined(_M_HYBRID_CHPE_ARM64))
+#if defined(_M_X64)
+        *a = _umul128(*a, *b, b);
+#else
+        uint64_t c = __umulh(*a, *b);
+        *a = *a * *b;
+        *b = c;
+#endif
+#else
+        uint64_t ha = *a >> 32, hb = *b >> 32, la = (uint32_t)*a, lb = (uint32_t)*b, hi, lo;
+        uint64_t rh = ha * hb, rm0 = ha * lb, rm1 = hb * la, rl = la * lb, t = rl + (rm0 << 32), c = t < rl;
+        lo = t + (rm1 << 32);
+        c += lo < t;
+        hi = rh + (rm0 >> 32) + (rm1 >> 32) + c;
+        *a = lo;
+        *b = hi;
+#endif
+    };
+
+    auto const rapid_mix = [&rapid_mum](uint64_t a, uint64_t b) -> uint64_t {
+        rapid_mum(&a, &b);
+        return a ^ b;
+    };
+
+    // This effectifely behaves like rapidhash with that input:
+    // seed: k0, data: [val, k1, extra]
+    // So it hashes 32+8+4 = 44 bytes, plus uses the 8 byte as seed.
+
+    // Default secret parameters.
+    static constexpr uint64_t secret[3] = {0x2d358dccaa6c78a5ull, 0x8bb84b93962eacc9ull, 0x4b33a62ed433d4a3ull};
+
+    // no need to mix the seed itself, as it is purely random.
+    uint64_t seed = k0;
+    seed = rapid_mix(val.GetUint64(0) ^ secret[2], val.GetUint64(1) ^ seed ^ secret[1]);
+    seed = rapid_mix(val.GetUint64(2) ^ secret[2], val.GetUint64(3) ^ seed);
+    uint64_t a = k1 ^ secret[1];
+    uint64_t b = extra ^ seed;
+    rapid_mum(&a, &b);
+    return rapid_mix(a ^ secret[0], b ^ secret[1]);
+}
+
+static void RapidHash_32b(benchmark::Bench& bench)
+{
+    FastRandomContext rng{/*fDeterministic=*/true};
+    auto k0{rng.rand64()}, k1{rng.rand64()};
+    auto val{rng.rand256()};
+    auto extra{rng.rand32()};
+    auto i{0U};
+    bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+        ankerl::nanobench::doNotOptimizeAway(ModifiedRapidHash(k0, k1, val, extra));
+        ++k0;
+        ++k1;
+        ++extra;
         ++i;
         val.data()[i % uint256::size()] ^= i & 0xFF;
     });
@@ -276,6 +373,8 @@ BENCHMARK(SHA256_32b_SSE4, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SHA256_32b_AVX2, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SHA256_32b_SHANI, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SipHash_32b, benchmark::PriorityLevel::HIGH);
+BENCHMARK(Sha256Shani_32b, benchmark::PriorityLevel::HIGH);
+BENCHMARK(RapidHash_32b, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SHA256D64_1024_STANDARD, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SHA256D64_1024_SSE4, benchmark::PriorityLevel::HIGH);
 BENCHMARK(SHA256D64_1024_AVX2, benchmark::PriorityLevel::HIGH);
@@ -286,3 +385,4 @@ BENCHMARK(MuHashMul, benchmark::PriorityLevel::HIGH);
 BENCHMARK(MuHashDiv, benchmark::PriorityLevel::HIGH);
 BENCHMARK(MuHashPrecompute, benchmark::PriorityLevel::HIGH);
 BENCHMARK(MuHashFinalize, benchmark::PriorityLevel::HIGH);
+

RapidHash v3 was released only two days ago, so numbers may improve further.

If the hash isn't of cryptographic quality, I don't know how I'd reason about an attacker's ability to control the buckets

How do we reason about the same currently, given that some sources refer to it as not cryptographic quality either:

Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged.

The RapidHash benchmarks and studies also indicate that it has an ideal avalanche score - is there any other test we could do that would increase our confidence? Would a more seasoned hash be more welcome, such as HighwayHash or other ones with no known attacks and SIMD-enabled design, e.g. the ones from https://www.haikel-fazzani.eu.org/blog/post/siphash-alternatives-hash-algorithms with "Good" security?

I would like to understand the exact threat model here, is my understanding correct that:

  • we fear collisions because of linear bucket iteration for identical hashes, in case of a brute-force attack, right? If this is just a worst-case scenario (not a likely outcome), we could probably choose a map which stores colliding entries in a balanced tree instead of a linked list;
  • we expect natural collisions to be rare, and deliberate ones to require significant effort;
  • as long as the attackers cannot pre-compute exactly which coins will end up in the exact same bucket, we would be fine;
  • we don't store unconfirmed transactions in the coins cache, only outputs that are already mined;
  • the inputs are really difficult to tweak, given that one of them is already a dSHA256, not a value the attacker can cheaply set to any custom value;
  • colliding entries will likely depend on the coins-cache size - randomizing the default slightly might help;
  • most likely (worst-case?) outcome is a local slowdown;
  • once an attack is detected, we will likely have time to mitigate it in the next version.

But zooming out, isn't the main problem here that the dbcache is resized too often, triggering many unnecessary rehashes?
Wouldn't SipHash suffice here if we just pre-sized the coins cache during IBD more precisely and only recreated and shrank it after we've reached the tip (to reclaim memory)?
Also, SaltedOutpointHasher noexcept seems to have caused many of these rehashes - and I have tried reverting it and all my measurements show that we'd be faster without that change - see my preliminary measurements: bitcoin-dev-tools#163

Copy link
Member

@sipa sipa May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I shouldn't have used the term "cryptographic"; I see it's referred to as "secure" instead of "cryptographic" in some places.

There are two properties that a (keyed) hash function needs to be cryptographic:

  • Indistinguishable from a random function
  • Have larger enough output to resist practical brute-force preimage or collision attacks.

RapidHash has neither. SipHash has the first (it was designed by cryptographers, and investigated cryptanalytically). HMAC-SHA256 has both.

However, for the purpose of hash table indexing, the second property is irrelevant, as the result is taken modulo the hash table size anyway (or otherwise mapped into the range of the table size), so a large output is a lost cause anyway. However, the attacker does not get to observe the key or the hash output, which makes it much less of a concern.

So in a way, I think of SipHash as a hash function that has all properties of a cryptographic one, except for the fact that it's restricted to a 64-bit output.

@martinus
Copy link
Contributor Author

Welcome back @martinus, we missed you! :) I will measure these changes separately until 890k blocks soon. I have included this change a tracking PR where we have other similar experiments: #32043

I have not done any programming in half a year, looking forward to getting back :)

@l0rinc
Copy link
Contributor

l0rinc commented Mar 26, 2025

I have measured these commits separately against a baseline (undoing each commit, to measure them independently).

  • 1d281da Merge bitcoin/bitcoin#32095: doc: clarify that testnet min-difficulty is the baseline
  • 1112433 SaltedOutpointHasher uses rapidhash is 5% faster than 1d281da
  • e00d828 CCoinsViewCache::BatchWrite lookup optimization is 0.5% faster than 1d281da
  • 1630e36 Use boost::unordered_node_map for CCoinsMap is 7.4% faster than 1d281da
Reindex-chainstate until block 888888
1d281daf86 (HEAD, origin/master, origin/HEAD) Merge bitcoin/bitcoin#32095: doc: clarify that testnet min-difficulty is not optional
1112433318 SaltedOutpointHasher uses rapidhash
e00d828362 CCoinsViewCache::BatchWrite lookup optimization
1630e368d1 (l0rinc/detached186) Use boost::unordered_node_map for CCoinsMap
Benchmark 1: COMPILER=gcc COMMIT=1d281daf8613a3cb7a26759c351ffb34dab0f656 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
  Time (abs ≡):        25746.852 s               [User: 24476.673 s, System: 1096.263 s]
 
Benchmark 2: COMPILER=gcc COMMIT=1112433318cbb096fec0cc83d2c424db652c126f ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
  Time (abs ≡):        24511.624 s               [User: 23046.157 s, System: 1074.909 s]
 
Benchmark 3: COMPILER=gcc COMMIT=e00d8283624f3d937f1222cb2c1540655045b095 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
  Time (abs ≡):        25616.379 s               [User: 24209.459 s, System: 1088.005 s]
 
Benchmark 4: COMPILER=gcc COMMIT=1630e368d190a75870399cc38af3c6023faaf2d9 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
  Time (abs ≡):        23966.508 s               [User: 22294.996 s, System: 1082.040 s]
 
Summary
  COMPILER=gcc COMMIT=1630e368d190a75870399cc38af3c6023faaf2d9 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0 ran
    1.02 times faster than COMPILER=gcc COMMIT=1112433318cbb096fec0cc83d2c424db652c126f ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
    1.07 times faster than COMPILER=gcc COMMIT=e00d8283624f3d937f1222cb2c1540655045b095 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
    1.07 times faster than COMPILER=gcc COMMIT=1d281daf8613a3cb7a26759c351ffb34dab0f656 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0

boost::unordered_node_map is a highly optimized hashmap, available since
boost 1.82, that is API compatible to std::unordered_map for our use case.
It also can use the existing PoolAllocator.
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

@DrahtBot
Copy link
Contributor

⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?

  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants