-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Draft: CCoinMap Experiments #32128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Draft: CCoinMap Experiments #32128
Conversation
SipHashUint256Extra is rather slow. For the purpose of generating a hash from a COutPoint and some seed that is only used inside a hashmap, it is sufficient to use a non-cryptographic hash. rapidhash [1] is a well tested and very fast hash function. This implementation strips down this hash function, originally implemented for a memory buffer, to be used with the COutPoint + 2*64bit seed as the input. [1] https://github.com/Nicoshev/rapidhash
In the case when parent cache does not have an entry while child cache does, this reduces the double hash lookup (find + try_emplace) with a single try_emplace.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32128. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::equal_to<COutPoint>, | ||
PoolAllocator<CoinsCachePair, | ||
sizeof(CoinsCachePair) + sizeof(void*) * 4>>; | ||
using CCoinsMap = boost::unordered_node_map<COutPoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we're trying to get away from external dependencies, so if this does improve performance, we may want to copy it over.
For the record, similar attempts have been made before: #22640
@andrewtoth suggested we try your unordered_dense
here as well.
My understanding is that this will likely require retesting our memory related assumption so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think boost maps might be a better choice, I believe they are better tested, and might be faster as well. Also for boost::unordered_node_map
references are stable, which is not the case for unordered_dense
, so it is easier to have this as a drop-in replacement without worrying about pointer stability.
CCoinsMap::iterator itUs; | ||
bool isInserted = false; | ||
if (!(it->second.IsFresh() && it->second.coin.IsSpent())) { | ||
std::tie(itUs, isInserted) = cacheCoins.try_emplace(it->first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems similar to #30326 - I'm surprised this has a measurable effect (and that I haven't found it so far :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it has far less of an effect in your test, maybe I should also run the benchmark on my machine until block 890k for comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe these are only measurable after the other changes are in place - I have compared all commits independently.
@@ -44,5 +44,6 @@ class CSipHasher | |||
*/ | |||
uint64_t SipHashUint256(uint64_t k0, uint64_t k1, const uint256& val); | |||
uint64_t SipHashUint256Extra(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra); | |||
uint64_t ModifiedRapidHash(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also measured this being a bottleneck very often (#30442), and @sipa mentioned originally that we can swap it out with something better later: #8020 (comment)
Do we need to account for collision attacks here, given that inclusion into the map is not for free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think collision attacks can't be an issue because the hash implementation still uses the random seed, and the hash itself is of good quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty skeptical about that claim. If the hash isn't of cryptographic quality, I don't know how I'd reason about an attacker's ability to control the buckets, even with a secret random seed.
To be clear, it's entirely reasonable that this may be safe, but I don't know how I'd convince myself that it is definitely safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sipa thanks, you are of course right; I should have read a bit more about that before... Nevertheless I think it's at least interesting to see what difference a fast hash would make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SipHash
is decent (I tried several other candidates and most were slower), but not particularly fast (even a general SHA256-shani is in the same ball-park).
RapidHash
is much simpler and, on short inputs, dramatically faster:
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
0.06 | 16,529,344,089.83 | 0.2% | 11.00 | RapidHash_32b |
0.96 | 1,043,368,577.00 | 0.1% | 11.01 | Sha256Shani_32b using the 'arm_shani(1way,2way)' SHA256 implementation |
0.39 | 2,547,686,515.00 | 0.2% | 10.98 | SipHash_32b |
Benchmarks
diff --git a/src/bench/crypto_hash.cpp b/src/bench/crypto_hash.cpp
index 2f1ff56438..cab34169a5 100644
--- a/src/bench/crypto_hash.cpp
+++ b/src/bench/crypto_hash.cpp
@@ -195,11 +195,108 @@ static void SipHash_32b(benchmark::Bench& bench)
FastRandomContext rng{/*fDeterministic=*/true};
auto k0{rng.rand64()}, k1{rng.rand64()};
auto val{rng.rand256()};
+ auto extra{rng.rand32()};
auto i{0U};
- bench.run([&] {
- ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, val));
+ bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+ ankerl::nanobench::doNotOptimizeAway(SipHashUint256Extra(k0, k1, val, extra));
++k0;
++k1;
+ ++extra;
+ ++i;
+ val.data()[i % uint256::size()] ^= i & 0xFF;
+ });
+}
+
+static void Sha256Shani_32b(benchmark::Bench& bench)
+{
+ FastRandomContext rng{/*fDeterministic=*/true};
+ auto k0{rng.rand64()}, k1{rng.rand64()};
+ auto val{rng.rand256()};
+ auto extra{rng.rand32()};
+ auto i{0U};
+ uint8_t hash[CSHA256::OUTPUT_SIZE];
+
+ bench.name(strprintf("%s using the '%s' SHA256 implementation", __func__, SHA256AutoDetect(sha256_implementation::USE_SSE4_AND_SHANI)));
+ uint8_t digest[CSHA256::OUTPUT_SIZE];
+ bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+ CSHA256()
+ .Write(reinterpret_cast<const uint8_t*>(&k0), sizeof(k0))
+ .Write(reinterpret_cast<const uint8_t*>(&k1), sizeof(k1))
+ .Write(val.data(), val.size())
+ .Write(reinterpret_cast<const uint8_t*>(&extra), sizeof(extra))
+ .Finalize(hash);
+
+ ankerl::nanobench::doNotOptimizeAway(digest[0]);
+ ++k0;
+ ++k1;
+ ++extra;
+ ++i;
+ val.data()[i % uint256::size()] ^= i & 0xFF;
+ });
+ SHA256AutoDetect();
+}
+
+uint64_t ModifiedRapidHash(uint64_t k0, uint64_t k1, const uint256& val, uint32_t extra)
+{
+ auto const rapid_mum = [](uint64_t* a, uint64_t* b) {
+#if defined(__SIZEOF_INT128__)
+ __uint128_t r = *a;
+ r *= *b;
+ *a = (uint64_t)r;
+ *b = (uint64_t)(r >> 64);
+#elif defined(_MSC_VER) && (defined(_WIN64) || defined(_M_HYBRID_CHPE_ARM64))
+#if defined(_M_X64)
+ *a = _umul128(*a, *b, b);
+#else
+ uint64_t c = __umulh(*a, *b);
+ *a = *a * *b;
+ *b = c;
+#endif
+#else
+ uint64_t ha = *a >> 32, hb = *b >> 32, la = (uint32_t)*a, lb = (uint32_t)*b, hi, lo;
+ uint64_t rh = ha * hb, rm0 = ha * lb, rm1 = hb * la, rl = la * lb, t = rl + (rm0 << 32), c = t < rl;
+ lo = t + (rm1 << 32);
+ c += lo < t;
+ hi = rh + (rm0 >> 32) + (rm1 >> 32) + c;
+ *a = lo;
+ *b = hi;
+#endif
+ };
+
+ auto const rapid_mix = [&rapid_mum](uint64_t a, uint64_t b) -> uint64_t {
+ rapid_mum(&a, &b);
+ return a ^ b;
+ };
+
+ // This effectifely behaves like rapidhash with that input:
+ // seed: k0, data: [val, k1, extra]
+ // So it hashes 32+8+4 = 44 bytes, plus uses the 8 byte as seed.
+
+ // Default secret parameters.
+ static constexpr uint64_t secret[3] = {0x2d358dccaa6c78a5ull, 0x8bb84b93962eacc9ull, 0x4b33a62ed433d4a3ull};
+
+ // no need to mix the seed itself, as it is purely random.
+ uint64_t seed = k0;
+ seed = rapid_mix(val.GetUint64(0) ^ secret[2], val.GetUint64(1) ^ seed ^ secret[1]);
+ seed = rapid_mix(val.GetUint64(2) ^ secret[2], val.GetUint64(3) ^ seed);
+ uint64_t a = k1 ^ secret[1];
+ uint64_t b = extra ^ seed;
+ rapid_mum(&a, &b);
+ return rapid_mix(a ^ secret[0], b ^ secret[1]);
+}
+
+static void RapidHash_32b(benchmark::Bench& bench)
+{
+ FastRandomContext rng{/*fDeterministic=*/true};
+ auto k0{rng.rand64()}, k1{rng.rand64()};
+ auto val{rng.rand256()};
+ auto extra{rng.rand32()};
+ auto i{0U};
+ bench.batch(uint256::size() + sizeof(uint32_t) / 8).unit("byte").run([&] {
+ ankerl::nanobench::doNotOptimizeAway(ModifiedRapidHash(k0, k1, val, extra));
+ ++k0;
+ ++k1;
+ ++extra;
++i;
val.data()[i % uint256::size()] ^= i & 0xFF;
});
@@ -276,6 +373,8 @@ BENCHMARK(SHA256_32b_SSE4, benchmark::PriorityLevel::HIGH);
BENCHMARK(SHA256_32b_AVX2, benchmark::PriorityLevel::HIGH);
BENCHMARK(SHA256_32b_SHANI, benchmark::PriorityLevel::HIGH);
BENCHMARK(SipHash_32b, benchmark::PriorityLevel::HIGH);
+BENCHMARK(Sha256Shani_32b, benchmark::PriorityLevel::HIGH);
+BENCHMARK(RapidHash_32b, benchmark::PriorityLevel::HIGH);
BENCHMARK(SHA256D64_1024_STANDARD, benchmark::PriorityLevel::HIGH);
BENCHMARK(SHA256D64_1024_SSE4, benchmark::PriorityLevel::HIGH);
BENCHMARK(SHA256D64_1024_AVX2, benchmark::PriorityLevel::HIGH);
@@ -286,3 +385,4 @@ BENCHMARK(MuHashMul, benchmark::PriorityLevel::HIGH);
BENCHMARK(MuHashDiv, benchmark::PriorityLevel::HIGH);
BENCHMARK(MuHashPrecompute, benchmark::PriorityLevel::HIGH);
BENCHMARK(MuHashFinalize, benchmark::PriorityLevel::HIGH);
+
RapidHash v3 was released only two days ago, so numbers may improve further.
If the hash isn't of cryptographic quality, I don't know how I'd reason about an attacker's ability to control the buckets
How do we reason about the same currently, given that some sources refer to it as not cryptographic quality either:
Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged.
The RapidHash
benchmarks and studies also indicate that it has an ideal avalanche score - is there any other test we could do that would increase our confidence? Would a more seasoned hash be more welcome, such as HighwayHash
or other ones with no known attacks and SIMD-enabled design, e.g. the ones from https://www.haikel-fazzani.eu.org/blog/post/siphash-alternatives-hash-algorithms with "Good" security?
I would like to understand the exact threat model here, is my understanding correct that:
- we fear collisions because of linear bucket iteration for identical hashes, in case of a brute-force attack, right? If this is just a worst-case scenario (not a likely outcome), we could probably choose a map which stores colliding entries in a balanced tree instead of a linked list;
- we expect natural collisions to be rare, and deliberate ones to require significant effort;
- as long as the attackers cannot pre-compute exactly which coins will end up in the exact same bucket, we would be fine;
- we don't store unconfirmed transactions in the coins cache, only outputs that are already mined;
- the inputs are really difficult to tweak, given that one of them is already a dSHA256, not a value the attacker can cheaply set to any custom value;
- colliding entries will likely depend on the coins-cache size - randomizing the default slightly might help;
- most likely (worst-case?) outcome is a local slowdown;
- once an attack is detected, we will likely have time to mitigate it in the next version.
But zooming out, isn't the main problem here that the dbcache is resized too often, triggering many unnecessary rehashes?
Wouldn't SipHash suffice here if we just pre-sized the coins cache during IBD more precisely and only recreated and shrank it after we've reached the tip (to reclaim memory)?
Also, SaltedOutpointHasher noexcept
seems to have caused many of these rehashes - and I have tried reverting it and all my measurements show that we'd be faster without that change - see my preliminary measurements: bitcoin-dev-tools#163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I shouldn't have used the term "cryptographic"; I see it's referred to as "secure" instead of "cryptographic" in some places.
There are two properties that a (keyed) hash function needs to be cryptographic:
- Indistinguishable from a random function
- Have larger enough output to resist practical brute-force preimage or collision attacks.
RapidHash has neither. SipHash has the first (it was designed by cryptographers, and investigated cryptanalytically). HMAC-SHA256 has both.
However, for the purpose of hash table indexing, the second property is irrelevant, as the result is taken modulo the hash table size anyway (or otherwise mapped into the range of the table size), so a large output is a lost cause anyway. However, the attacker does not get to observe the key or the hash output, which makes it much less of a concern.
So in a way, I think of SipHash as a hash function that has all properties of a cryptographic one, except for the fact that it's restricted to a 64-bit output.
I have measured these commits separately against a baseline (undoing each commit, to measure them independently).
Reindex-chainstate until block 8888881d281daf86 (HEAD, origin/master, origin/HEAD) Merge bitcoin/bitcoin#32095: doc: clarify that testnet min-difficulty is not optional
1112433318 SaltedOutpointHasher uses rapidhash
e00d828362 CCoinsViewCache::BatchWrite lookup optimization
1630e368d1 (l0rinc/detached186) Use boost::unordered_node_map for CCoinsMap
Benchmark 1: COMPILER=gcc COMMIT=1d281daf8613a3cb7a26759c351ffb34dab0f656 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
Time (abs ≡): 25746.852 s [User: 24476.673 s, System: 1096.263 s]
Benchmark 2: COMPILER=gcc COMMIT=1112433318cbb096fec0cc83d2c424db652c126f ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
Time (abs ≡): 24511.624 s [User: 23046.157 s, System: 1074.909 s]
Benchmark 3: COMPILER=gcc COMMIT=e00d8283624f3d937f1222cb2c1540655045b095 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
Time (abs ≡): 25616.379 s [User: 24209.459 s, System: 1088.005 s]
Benchmark 4: COMPILER=gcc COMMIT=1630e368d190a75870399cc38af3c6023faaf2d9 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
Time (abs ≡): 23966.508 s [User: 22294.996 s, System: 1082.040 s]
Summary
COMPILER=gcc COMMIT=1630e368d190a75870399cc38af3c6023faaf2d9 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0 ran
1.02 times faster than COMPILER=gcc COMMIT=1112433318cbb096fec0cc83d2c424db652c126f ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
1.07 times faster than COMPILER=gcc COMMIT=e00d8283624f3d937f1222cb2c1540655045b095 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0
1.07 times faster than COMPILER=gcc COMMIT=1d281daf8613a3cb7a26759c351ffb34dab0f656 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=5000 -blocksonly -reindex-chainstate -connect=0 -printtoconsole=0 |
boost::unordered_node_map is a highly optimized hashmap, available since boost 1.82, that is API compatible to std::unordered_map for our use case. It also can use the existing PoolAllocator.
1ae2763
to
18b2c26
Compare
🐙 This pull request conflicts with the target branch and needs rebase. |
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
This is a draft PR to show various possible optimizations for
CCoinsMap
. In my benchmark, all these changes lead to a statistically significant speed improvement for-reindex-chainstate
.