Improve speed, memory efficiency with alternate hashmap (#2) #22640

jamesob · 2021-08-05T20:19:01Z

Resurrection of #16718

tl;dr

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

When profiling initial block download, it's evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

Profile of the msghand thread at height ~300,000.

The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

Obviously adding 2000 lines of new consensus-critical code isn't appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

Implementations considered

Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it's just a header file). I also looked at Facebook's F14NodeMap and Abseil's node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

Next steps

I'm going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

I'm also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

Questions

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

jamesob · 2021-08-05T20:34:15Z

I'm reopening this because I think it's worth some long-term consideration. We're not going to get many opportunities to improve block connection by ~12%, so I think this is worth continuing to look at. Now that bitcoin is on c++17, I wonder if there aren't some good ways to slim down @martinus's implementation, since it contains a number of provisions to deal with <c++17 platforms. At the very least we could work on fuzz testing this to our satisfaction.

I've rerun some benchmarks on an idle workstation and have confirmed that I continue to see ~13% speedups along with reduction in memory use, which is especially pronounced when using a large dbcache.

Issues to consider

From the README: "For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening."
- Insert fails with std::overflow_error again martinus/robin-hood-hashing#117
- Possible that we can amend this behavior, or at least convince ourselves this is appropriately unlikely given our use of siphash?

Benchmarks

dbcache=10000

commands index

bench name	command
ibd.local.range.500000.540000	`bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

jamesob/2019-08-robinhood vs. $mergebase (absolute)

bench name	x	jamesob/2019-08-robinhood	$mergebase
ibd.local.range.500000.540000.total_secs	3	2812.5091 (± 27.2604)	3190.3358 (± 17.5632)
ibd.local.range.500000.540000.peak_rss_KiB	3	5517118.6667 (± 1750.8682)	6344530.6667 (± 3478.0996)
ibd.local.range.500000.540000.cpu_kernel_secs	3	260.3867 (± 2.5737)	263.0400 (± 2.1568)
ibd.local.range.500000.540000.cpu_user_secs	3	12714.3000 (± 7.6987)	13223.2667 (± 8.7341)

jamesob/2019-08-robinhood vs. $mergebase (relative)

bench name	x	jamesob/2019-08-robinhood	$mergebase
ibd.local.range.500000.540000.total_secs	3	1	1.134
ibd.local.range.500000.540000.peak_rss_KiB	3	1	1.150
ibd.local.range.500000.540000.cpu_kernel_secs	3	1	1.010
ibd.local.range.500000.540000.cpu_user_secs	3	1	1.040

dbcache=300

commands index

bench name	command
ibd.local.range.500000.535000	`bitcoind -dbcache=300 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

2019-08-robinhood vs. master (absolute)

bench name	x	2019-08-robinhood	master
ibd.local.range.500000.535000.total_secs	3	3592.5329 (± 202.2415)	4070.7353 (± 13.1181)
ibd.local.range.500000.535000.peak_rss_KiB	3	2049018.6667 (± 147027.9395)	2110129.3333 (± 36744.0276)
ibd.local.range.500000.535000.cpu_kernel_secs	3	576.7433 (± 4.9359)	576.6567 (± 1.4616)
ibd.local.range.500000.535000.cpu_user_secs	3	10879.8767 (± 20.1319)	11260.5933 (± 8.3835)

2019-08-robinhood vs. master (relative)

bench name	x	2019-08-robinhood	master
ibd.local.range.500000.535000.total_secs	3	1.00	1.133
ibd.local.range.500000.535000.peak_rss_KiB	3	1.00	1.030
ibd.local.range.500000.535000.cpu_kernel_secs	3	1.00	1.000
ibd.local.range.500000.535000.cpu_user_secs	3	1.00	1.035

0xB10C · 2021-08-05T20:50:43Z

Cool. I plan on benchmarking this too (at least IBD speed) via the validation::connect_block USDT tracepoint.

martinus · 2021-08-06T07:15:48Z

It's unfortunately really hard to fix martinus/robin-hood-hashing#117 without losing at least some of the performance benefit. Also I don't really have the time to do it properly... So I'm personally quite a bit weary about using my robin_hood map for such a critical place. I mean I've been using it in for a long time in many countless installations and personally have never seen that problem, but other users of the map have definitely seen it. And it definitely is possible for an malicious actor to construct sequence of values that will cause the map to always fail. That's not possible for unordered_map, this would only degrades the performance.

I think it would be good to try the same benchmarks also with #16801 and also with tsl::sparse_map.

faster & less memory for sync: bulk pool allocator for node based containers #16801 introduces a relatively simple bulk pool allocator for std::unordered_map. I think most of the performance improvement from robin_hood comes due to the more efficient allocation, and that pool allocator would bring the same allocation behavior to std::unordered_map. The code can be updated with C++17 features which would make it less of a hack. Such a pool allocator could also be beneficial in other places in the code too, wherever node based containers are used.
Have a look at memory efficient hashmap implementation: tsl::sparse_map seems to be excellent, header only, and in my benchmarks while slower than robin_hood it is faster than std::unordered_map and more memory efficient than both. I think being able to keep more elements in memory should be a better trade off because it needs less flushes. https://github.com/Tessil/sparse-map

martinus · 2021-08-06T20:37:12Z

I have now rebased my branch https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator and done some rudimentary benchmarks: create the CCoinsMap, and 5 times insert 1 million entries and then clear it. I measured average time per insertion in ns/op and maximum resident set size in kbyte.

I tried std::unordered_map and std::map with and without the bulk_pool, and also tried a very simple hash:

struct Xor {
    size_t operator()(const COutPoint& id) const noexcept {
        return static_cast<size_t>(id.hash.GetUint64(0) ^ id.hash.GetUint64(1) ^ id.hash.GetUint64(2) ^ id.hash.GetUint64(3));
    }
};

I tried a few different hash implementations that I thought would be interesting, https://github.com/greg7mdp/parallel-hashmap, my https://github.com/martinus/robin-hood-hashing and https://github.com/Tessil/sparse-map.

ns/op	RSS kB	map	Pool	Hash?
62.51	311252	robin_hood::unordered_flat_map		Xor
99.93	311392	robin_hood::unordered_flat_map		SaltedOutpointHasher
106.71	135460	std::unordered_map	Yes	Xor
130.00	126124	robin_hood::unordered_node_map		Xor
158.59	126124	robin_hood::unordered_node_map		SaltedOutpointHasher
175.24	149368	std::unordered_map		Xor
264.35	311212	phmap::flat_hash_map		Xor
317.85	135556	std::unordered_map	Yes	SaltedOutpointHasher
402.43	149460	std::unordered_map		SaltedOutpointHasher
463.53	128336	tsl::sparse_map		Xor
586.76	153776	std::map
545.62	112332	tsl::sparse_map		SaltedOutpointHasher
515.25	154712	phmap::node_hash_map		SimpleHasher
1,113.55	140448	std::map	Yes

Well, I actually didn't expect robin_hood map to be so good :)

Interestingly though, at least in that benchmark, std::unordered_map with the bulk_pool and the simple hash performs really well, even faster than robin_hood::unordered_node_map. I didn't expect tsl::sparse_map and phmap::node_hash_map to be so slow. Maybe they fare better in a real world benchmark with lots of lookups.

So I think it would be worthwhile to properly benchmark the bulk_pool way and a faster hash. Not sure if such a strong hash like siphash is actually needed here.

jamesob · 2021-08-09T16:10:40Z

Once again, @martinus has convinced me that the relative performance benefits of swapping out the entire map implementation aren't worth the risks inherent in robin_hood's more uncertain failure modes. Luckily, bitcoinperf benchmarks (#16801 (comment)) indicate that work on the allocator may (as Martin suspected) yield most of the benefits of the change here with much less risk.

I'm closing this PR (again) and I think we should focus effort on vetting the allocator in Martin's branch.

jamesob and others added 9 commits July 26, 2021 16:17

Use robin_hood.unordered_node_map for CCoinsViewCache.cacheCoins

df1c967

[robinhood] remove custom hash

dfdd18e

[robinhood] remove inline DataNode (flatmap=true)

960dbb5

[robinhood] remove robin_hood::pair

c046143

[robinhood] remove Cloner and Destroyer for flatmaps

f29a8a4

[robinhood] remove unnecessary constructors, operators

9acd37b

[robinhood] remove all references to IsFlatMap

c0cd241

[robinhood] remove integer sequence stuff

f76b368

[robin_hood] update to 3.11.3

c19d014

DrahtBot added the UTXO Db and Indexes label Aug 5, 2021

jamesob mentioned this pull request Aug 9, 2021

faster & less memory for sync: bulk pool allocator for node based containers #16801

Closed

jamesob closed this Aug 9, 2021

bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve speed, memory efficiency with alternate hashmap (#2) #22640

Improve speed, memory efficiency with alternate hashmap (#2) #22640

Uh oh!

jamesob commented Aug 5, 2021 •

edited

Loading

Uh oh!

jamesob commented Aug 5, 2021

Uh oh!

0xB10C commented Aug 5, 2021 •

edited

Loading

Uh oh!

martinus commented Aug 6, 2021

Uh oh!

martinus commented Aug 6, 2021

Uh oh!

jamesob commented Aug 9, 2021

Uh oh!

Uh oh!

Improve speed, memory efficiency with alternate hashmap (#2) #22640

Improve speed, memory efficiency with alternate hashmap (#2) #22640

Uh oh!

Conversation

jamesob commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementations considered

Next steps

Questions

Uh oh!

jamesob commented Aug 5, 2021

Issues to consider

Benchmarks

dbcache=10000

commands index

jamesob/2019-08-robinhood vs. $mergebase (absolute)

jamesob/2019-08-robinhood vs. $mergebase (relative)

dbcache=300

commands index

2019-08-robinhood vs. master (absolute)

2019-08-robinhood vs. master (relative)

Uh oh!

0xB10C commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinus commented Aug 6, 2021

Uh oh!

martinus commented Aug 6, 2021

Uh oh!

jamesob commented Aug 9, 2021

Uh oh!

Uh oh!

jamesob commented Aug 5, 2021 •

edited

Loading

0xB10C commented Aug 5, 2021 •

edited

Loading