Skip to content

Conversation

jamesob
Copy link
Contributor

@jamesob jamesob commented Aug 5, 2021

Resurrection of #16718


tl;dr

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.


When profiling initial block download, it's evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

Selection_147
Profile of the msghand thread at height ~300,000.

The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

ibd local range 500000 520000

For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

microbenches

Obviously adding 2000 lines of new consensus-critical code isn't appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

Implementations considered

Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it's just a header file). I also looked at Facebook's F14NodeMap and Abseil's node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

Next steps

I'm going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

I'm also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

Questions

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

@jamesob
Copy link
Contributor Author

jamesob commented Aug 5, 2021

I'm reopening this because I think it's worth some long-term consideration. We're not going to get many opportunities to improve block connection by ~12%, so I think this is worth continuing to look at. Now that bitcoin is on c++17, I wonder if there aren't some good ways to slim down @martinus's implementation, since it contains a number of provisions to deal with <c++17 platforms. At the very least we could work on fuzz testing this to our satisfaction.

I've rerun some benchmarks on an idle workstation and have confirmed that I continue to see ~13% speedups along with reduction in memory use, which is especially pronounced when using a large dbcache.

Issues to consider

  • From the README: "For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening."

Benchmarks

dbcache=10000

ibd local range 500000 540000

commands index

bench name command
ibd.local.range.500000.540000 bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

jamesob/2019-08-robinhood vs. $mergebase (absolute)

bench name x jamesob/2019-08-robinhood $mergebase
ibd.local.range.500000.540000.total_secs 3 2812.5091 (± 27.2604) 3190.3358 (± 17.5632)
ibd.local.range.500000.540000.peak_rss_KiB 3 5517118.6667 (± 1750.8682) 6344530.6667 (± 3478.0996)
ibd.local.range.500000.540000.cpu_kernel_secs 3 260.3867 (± 2.5737) 263.0400 (± 2.1568)
ibd.local.range.500000.540000.cpu_user_secs 3 12714.3000 (± 7.6987) 13223.2667 (± 8.7341)

jamesob/2019-08-robinhood vs. $mergebase (relative)

bench name x jamesob/2019-08-robinhood $mergebase
ibd.local.range.500000.540000.total_secs 3 1 1.134
ibd.local.range.500000.540000.peak_rss_KiB 3 1 1.150
ibd.local.range.500000.540000.cpu_kernel_secs 3 1 1.010
ibd.local.range.500000.540000.cpu_user_secs 3 1 1.040

dbcache=300

ibd local range 500000 535000

commands index

bench name command
ibd.local.range.500000.535000 bitcoind -dbcache=300 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

2019-08-robinhood vs. master (absolute)

bench name x 2019-08-robinhood master
ibd.local.range.500000.535000.total_secs 3 3592.5329 (± 202.2415) 4070.7353 (± 13.1181)
ibd.local.range.500000.535000.peak_rss_KiB 3 2049018.6667 (± 147027.9395) 2110129.3333 (± 36744.0276)
ibd.local.range.500000.535000.cpu_kernel_secs 3 576.7433 (± 4.9359) 576.6567 (± 1.4616)
ibd.local.range.500000.535000.cpu_user_secs 3 10879.8767 (± 20.1319) 11260.5933 (± 8.3835)

2019-08-robinhood vs. master (relative)

bench name x 2019-08-robinhood master
ibd.local.range.500000.535000.total_secs 3 1.00 1.133
ibd.local.range.500000.535000.peak_rss_KiB 3 1.00 1.030
ibd.local.range.500000.535000.cpu_kernel_secs 3 1.00 1.000
ibd.local.range.500000.535000.cpu_user_secs 3 1.00 1.035

@0xB10C
Copy link
Contributor

0xB10C commented Aug 5, 2021

Cool. I plan on benchmarking this too (at least IBD speed) via the validation::connect_block USDT tracepoint.

@martinus
Copy link
Contributor

martinus commented Aug 6, 2021

It's unfortunately really hard to fix martinus/robin-hood-hashing#117 without losing at least some of the performance benefit. Also I don't really have the time to do it properly... So I'm personally quite a bit weary about using my robin_hood map for such a critical place. I mean I've been using it in for a long time in many countless installations and personally have never seen that problem, but other users of the map have definitely seen it. And it definitely is possible for an malicious actor to construct sequence of values that will cause the map to always fail. That's not possible for unordered_map, this would only degrades the performance.

I think it would be good to try the same benchmarks also with #16801 and also with tsl::sparse_map.

  • faster & less memory for sync: bulk pool allocator for node based containers #16801 introduces a relatively simple bulk pool allocator for std::unordered_map. I think most of the performance improvement from robin_hood comes due to the more efficient allocation, and that pool allocator would bring the same allocation behavior to std::unordered_map. The code can be updated with C++17 features which would make it less of a hack. Such a pool allocator could also be beneficial in other places in the code too, wherever node based containers are used.

  • Have a look at memory efficient hashmap implementation: tsl::sparse_map seems to be excellent, header only, and in my benchmarks while slower than robin_hood it is faster than std::unordered_map and more memory efficient than both. I think being able to keep more elements in memory should be a better trade off because it needs less flushes. https://github.com/Tessil/sparse-map

@martinus
Copy link
Contributor

martinus commented Aug 6, 2021

I have now rebased my branch https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator and done some rudimentary benchmarks: create the CCoinsMap, and 5 times insert 1 million entries and then clear it. I measured average time per insertion in ns/op and maximum resident set size in kbyte.

I tried std::unordered_map and std::map with and without the bulk_pool, and also tried a very simple hash:

struct Xor {
    size_t operator()(const COutPoint& id) const noexcept {
        return static_cast<size_t>(id.hash.GetUint64(0) ^ id.hash.GetUint64(1) ^ id.hash.GetUint64(2) ^ id.hash.GetUint64(3));
    }
};

I tried a few different hash implementations that I thought would be interesting, https://github.com/greg7mdp/parallel-hashmap, my https://github.com/martinus/robin-hood-hashing and https://github.com/Tessil/sparse-map.

ns/op RSS kB map Pool Hash?
62.51 311252 robin_hood::unordered_flat_map Xor
99.93 311392 robin_hood::unordered_flat_map SaltedOutpointHasher
106.71 135460 std::unordered_map Yes Xor
130.00 126124 robin_hood::unordered_node_map Xor
158.59 126124 robin_hood::unordered_node_map SaltedOutpointHasher
175.24 149368 std::unordered_map Xor
264.35 311212 phmap::flat_hash_map Xor
317.85 135556 std::unordered_map Yes SaltedOutpointHasher
402.43 149460 std::unordered_map SaltedOutpointHasher
463.53 128336 tsl::sparse_map Xor
586.76 153776 std::map
545.62 112332 tsl::sparse_map SaltedOutpointHasher
515.25 154712 phmap::node_hash_map SimpleHasher
1,113.55 140448 std::map Yes

Well, I actually didn't expect robin_hood map to be so good :)

Interestingly though, at least in that benchmark, std::unordered_map with the bulk_pool and the simple hash performs really well, even faster than robin_hood::unordered_node_map. I didn't expect tsl::sparse_map and phmap::node_hash_map to be so slow. Maybe they fare better in a real world benchmark with lots of lookups.

So I think it would be worthwhile to properly benchmark the bulk_pool way and a faster hash. Not sure if such a strong hash like siphash is actually needed here.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 9, 2021

Once again, @martinus has convinced me that the relative performance benefits of swapping out the entire map implementation aren't worth the risks inherent in robin_hood's more uncertain failure modes. Luckily, bitcoinperf benchmarks (#16801 (comment)) indicate that work on the allocator may (as Martin suspected) yield most of the benefits of the change here with much less risk.

I'm closing this PR (again) and I think we should focus effort on vetting the allocator in Martin's branch.

@jamesob jamesob closed this Aug 9, 2021
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants