-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Improve speed, memory efficiency with alternate hashmap (#2) #22640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm reopening this because I think it's worth some long-term consideration. We're not going to get many opportunities to improve block connection by ~12%, so I think this is worth continuing to look at. Now that bitcoin is on c++17, I wonder if there aren't some good ways to slim down @martinus's implementation, since it contains a number of provisions to deal with <c++17 platforms. At the very least we could work on fuzz testing this to our satisfaction. I've rerun some benchmarks on an idle workstation and have confirmed that I continue to see ~13% speedups along with reduction in memory use, which is especially pronounced when using a large dbcache. Issues to consider
Benchmarksdbcache=10000commands index
jamesob/2019-08-robinhood vs. $mergebase (absolute)
jamesob/2019-08-robinhood vs. $mergebase (relative)
dbcache=300commands index
2019-08-robinhood vs. master (absolute)
2019-08-robinhood vs. master (relative)
|
Cool. I plan on benchmarking this too (at least IBD speed) via the |
It's unfortunately really hard to fix martinus/robin-hood-hashing#117 without losing at least some of the performance benefit. Also I don't really have the time to do it properly... So I'm personally quite a bit weary about using my I think it would be good to try the same benchmarks also with #16801 and also with
|
I have now rebased my branch https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator and done some rudimentary benchmarks: create the I tried struct Xor {
size_t operator()(const COutPoint& id) const noexcept {
return static_cast<size_t>(id.hash.GetUint64(0) ^ id.hash.GetUint64(1) ^ id.hash.GetUint64(2) ^ id.hash.GetUint64(3));
}
}; I tried a few different hash implementations that I thought would be interesting, https://github.com/greg7mdp/parallel-hashmap, my https://github.com/martinus/robin-hood-hashing and https://github.com/Tessil/sparse-map.
Well, I actually didn't expect robin_hood map to be so good :) Interestingly though, at least in that benchmark, So I think it would be worthwhile to properly benchmark the bulk_pool way and a faster hash. Not sure if such a strong hash like siphash is actually needed here. |
Once again, @martinus has convinced me that the relative performance benefits of swapping out the entire map implementation aren't worth the risks inherent in I'm closing this PR (again) and I think we should focus effort on vetting the allocator in Martin's branch. |
Resurrection of #16718
tl;dr
I try swapping out
std::unordered_map
for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.When profiling initial block download, it's evident that a decent chunk of time on the critical path is spent in
CCoinsViewCache
, the data structure responsible for the in-memory representation of the UTXO set.Profile of the
msghand
thread at height ~300,000.The essential data member of the cache,
cacheCoins
, is astd::unordered_map
that holds as many UTXO as can fit into memory given thedbcache
setting. While this map implementation is definitely preferable tostd::map
(which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std
alternatives.After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.
For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).
The
CCoinsCaching
benchmark shows improvement but not as dramatically as IBD does:Obviously adding 2000 lines of new consensus-critical code isn't appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.
Implementations considered
Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies,
robin_hood::unordered_node_map
(it's just a header file). I also looked at Facebook'sF14NodeMap
and Abseil'snode_hash_map
but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.Next steps
I'm going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.
I'm also interested in benching this along with a similar change to
mapBlockIndex
(currently astd::unordered_map
).Questions
If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?