Skip to content

Conversation

jamesob
Copy link
Contributor

@jamesob jamesob commented Aug 25, 2019

tl;dr

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.


When profiling initial block download, it's evident that a decent chunk of time on the critical path is spent in CCoinsViewCache, the data structure responsible for the in-memory representation of the UTXO set.

Selection_147
Profile of the msghand thread at height ~300,000.

The essential data member of the cache, cacheCoins, is a std::unordered_map that holds as many UTXO as can fit into memory given the dbcache setting. While this map implementation is definitely preferable to std::map (which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std alternatives.

After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.

ibd local range 500000 520000

For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).

The CCoinsCaching benchmark shows improvement but not as dramatically as IBD does:

microbenches

Obviously adding 2000 lines of new consensus-critical code isn't appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.

Implementations considered

Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies, robin_hood::unordered_node_map (it's just a header file). I also looked at Facebook's F14NodeMap and Abseil's node_hash_map but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.

Next steps

I'm going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.

I'm also interested in benching this along with a similar change to mapBlockIndex (currently a std::unordered_map).

Questions

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

@promag
Copy link
Contributor

promag commented Aug 25, 2019

including a near-full IBD from the network, to see if these improvements hold.

Yap that would be interesting.

I'm also interested in benching this along with a similar change to mapBlockIndex

I'd be surprised to see such improvements here.

If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?

Either way it should include tests.

@DrahtBot
Copy link
Contributor

DrahtBot commented Aug 25, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #16910 (wallet: reduce loading time by using unordered maps by achow101)
  • #16801 (faster & less memory for sync: bulk pool allocator for node based containers by martinus)
  • #15606 ([experimental] UTXO snapshots by jamesob)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@practicalswift
Copy link
Contributor

@jamesob Interesting! Thanks for providing benchmarks!

Which implementation of std::unordered_map did you benchmark against?

It would be interesting to see this benchmarked against the three major implementations of std::unordered_map (libstdc++, libc++ and Microsoft STL).

@jamesob
Copy link
Contributor Author

jamesob commented Aug 25, 2019

Which implementation of std::unordered_map did you benchmark against?

I've been benching on Ubuntu and Debian systems with libstdc++ (x86_64, glibc6).

@fanquake
Copy link
Member

I try swapping out std::unordered_map for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.

Nice. This is the kind of PR I enjoy reading / reviewing / testing.

I did a basic benchmark on my macOS machine: time src/bitcoind -stopatheight=350000.

#16718 6f9882c - 48m49s
master db67101 - 55m22s

Will run something to a more intense height later.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 26, 2019

Some additional context from further benchmarking:

Local tests

I did a local IBD (not real network) comparison up to height 510,000. We see a speedup of roughly 17% coupled with memory savings of roughly 13.5%.

Given the vertical shift in real memory usage curve on the robinhood branch (pictured below), I suspect there's a problem with my implementation of the DynamicUsage() memory estimator for the new robin_hood map. We appear to be over-reporting estimated memory usage of the new map relative to actual, so fixing this might lead to additional speedups for configurations with a limiting dbcache.

ibd local 510000

2019-08-robinhood vs. master (absolute)

bench name x 2019-08-robinhood master
ibd.local.510000.total_secs 1 7827.2801 (± 0.0000) 9447.3154 (± 0.0000)
ibd.local.510000.peak_rss_KiB 1 4863324.0000 (± 0.0000) 5620656.0000 (± 0.0000)

2019-08-robinhood vs. master (relative)

bench name x 2019-08-robinhood master
ibd.local.510000.total_secs 1 1 1.207
ibd.local.510000.peak_rss_KiB 1 1 1.156

IBD from the real network

I let IBD run overnight up to height 550,000 using peers from the real network for both master and this branch. The differences in speed were not as compelling here (a 6% gain for the robinhood branch) but memory differences were still pronounced: 14.6% percent savings for the robinhood branch (6502MB vs. 7615MB, both at a dbcache of 5000).

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Linux 4.4.0-119-generic x86_64

`master` to height 550_000, dbcache=5000:

Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
User time (seconds): 54822.52
System time (seconds): 929.91
Percent of CPU this job got: 121%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:44:12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7615768
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1715
Minor (reclaiming a frame) page faults: 72418086
Voluntary context switches: 26934810
Involuntary context switches: 287115
Swaps: 0
File system inputs: 451040
File system outputs: 596633792
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0


On the robinhood branch:

Command being timed: "./src/bitcoind -datadir=/data/bitcoin-rh -printtoconsole=0 -stopatheight=550000 -dbcache=5000"
User time (seconds): 52074.03
System time (seconds): 914.95
Percent of CPU this job got: 123%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:57:01
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6502096
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 14411
Minor (reclaiming a frame) page faults: 72379830
Voluntary context switches: 22815471
Involuntary context switches: 290463
Swaps: 0
File system inputs: 770608
File system outputs: 597949704
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

The "real" gains here dampen my enthusiasm somewhat, but the memory story is compelling. Curious for others to weigh in on whether these differences merit the increased risk of bringing in a non-std map implementation.

@jamesob jamesob force-pushed the 2019-08-robinhood branch from 6f9882c to 33b9786 Compare August 26, 2019 19:35
@jamesob
Copy link
Contributor Author

jamesob commented Aug 27, 2019

Consider my enthusiasm renewed because the latest "real" IBD runs came back showing a hard-to-believe 40% speedup. Sync to 550,000 on different machines but identical hardware. Still a 12% memory savings.

Obviously it's hard to gauge variance when syncing from the real network.

2019-08-robinhood

Selection_149

master

Selection_148

Apologies for the images - tmux over SSH makes copying text a pain.

@JeremyRubin
Copy link
Contributor

Concept ack, NACK this implementation. From the docs:

Depends on good Hashing. For a really bad hash the performance will not only degrade like in std::unordered_map, the map will simply fail with an std::overflow_error. In practice, when using the standard robin_hood::hash, I have never seen this happening.

There are some ways that we could maybe get some better performance with STD::unordered_map, will take a look and give some feedback later.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 27, 2019

the map will simply fail with an std::overflow_error

Thanks for the close look @JeremyRubin. This is exactly the kind of change I'd be really worried about introducing. The performance difference will have to be really compelling (and verified by a few people aside from myself) to consider using a non-std implementation for something as consensus-sensitive as cacheCoins.

I wonder if, assuming we did find a non-std implementation that we're comfortable with, we should initially leave its use as an opt-in configure flag that, e.g., installations running on lower-power hardware can make use of.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 27, 2019

Alternatively, we could look at forking robin-hood-hashing and removing all instances of throwOverflowError() with something more conservative.

@elichai
Copy link
Contributor

elichai commented Aug 27, 2019

Concept ACK for replacing the hashmap, std::map is a disaster. (std::unordered_map is a little bit better but still really bad).

@JeremyRubin
Copy link
Contributor

@jamesob maybe have a look at the experimental rust fork?

You could just wrap the rust hasmap (with at least rust v1.35 or use hashbrown crate). Rust's hashmap is based on swisstable.

Replacing sensitive data structures with rust seems like a good use case given long term goals, and can be used for low resource nodes experimentally?

@elichai
Copy link
Contributor

elichai commented Aug 27, 2019

@JeremyRubin
I just had a talk with him about this,
Personally I'd love to see rust inside of bitcoin, but I'm not sure a hashmap is the best first step as integrating rust and c++ requires a C ABI in between.

I do hope I'll have time to test the C++ SwissTable implementation and compare benchmarks https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h

@jamesob
Copy link
Contributor Author

jamesob commented Aug 27, 2019

I agree with @elichai - I think adding Rust as a dependency for one of the most consensus-sensitive data structures is way too big a leap. A few extra thousand lines of heavily-scrutinized cpp is about the biggest risk I'm willing to take personally, and that's assuming a ton of review, testing, and maybe a gradual deployment process.

@elichai
Copy link
Contributor

elichai commented Aug 27, 2019

  1. Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).

  2. Why did you choose unordered_node_map and not unordered_flat_map?.

  3. Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

@jamesob
Copy link
Contributor Author

jamesob commented Aug 27, 2019

Would love a quick explanation on how to run all these benchmarks myself (so I can get both a control before changing and after).

You can use bitcoinperf to perform the host-local comparative benchmarks I've posted above. I've committed an example file for this branch here - note that you're going to have to change some of the paths and potentially create some partially synced datadirs to be able to use it.

Why did you choose unordered_node_map and not unordered_flat_map?

In the case of cacheCoins, the values can be fairly large (>55 bytes) since they include scriptPubkeys. The flat map is good for values of small size, but shuffling around medium-to-large objects isn't the intended use. The node map allocates values separately and stores pointers instead. See Folly's docs for more details.

That said, I'm happy to benchmark against a flat_map implementation to see how it compares to node_map.

Do you think it would be interesting in the future for other uses of std::unordered_map in the code? (and potentially even std::map)

Yeah, I do, though I don't expect any other piece of data in the system will give us such dramatic improvements in performance. If we're going to use this map for arguably the most sensitive data structure in the system, we might as well use it elsewhere. I think a compelling candidate would be mapBlockIndex, but there are probably other good uses as well.

@elichai
Copy link
Contributor

elichai commented Aug 27, 2019

Sounds good :) Thanks!
I'll try also to benchmark myself, curious the difference between unordered_node_map and unordered_flat_map over unique pointers.

@JeremyRubin
Copy link
Contributor

Ok I've had a chance to more tightly review the code here. I didn't want to say so earlier, because I wasn't sure, but I think it should be possible to modify the cuckoocache code minimally to support cacheCoins (store a key/value and (maybe) get rid of atomics for erasure).

I can put together a PoC of that... there are some other benefits as CuckooCache is specifically a cache and not a map being used as a cache.

This is already a consensus dependency, and has been in since v0.14, so it introduces less new code.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 28, 2019

Latest round of benchmarks are in from two dedicated bench machines with identical specs (Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz, 8GB memory, SSD).

First I reran the last test I reported above (/usr/bin/time -v ./src/bitcoind -stopatheight=550000 -dbcache=5000) to ensure I could recreate the results. To my surprise, they held (40% speedup, 13% memory savings). To ensure that the wide disparity in performance wasn't due to a difference in the machines, I did a crossover: I swapped branches on each machine and reran the IBD. Same results.

master (a7be1cc):

Elapsed (wall clock) time (h:mm:ss or m:ss): 6:47:37
Maximum resident set size (kbytes): 7425920

this branch:

Elapsed (wall clock) time (h:mm:ss or m:ss): 4:03:51
Maximum resident set size (kbytes): 6436820

This is a remarkable performance improvement that I'm eager to see reproduced by others.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 28, 2019

I think it should be possible to modify the cuckoocache code minimally to support cacheCoins

@JeremyRubin I'd be very eager to see your PoC for this. In principle I strongly agree that if we can modify existing consensus-supporting data structures and see similar performance benefits, that's the way to go.

CuckooCache is specifically a cache and not a map being used as a cache

I'm curious about what you mean here, and it highlights one of my concerns about repurposing CuckooCache. As you probably know, we can't just evict stuff from cacheCoins - we have to be very careful about persisting cache content (FlushStateToDisk()) before eviction. At the moment it looks like CuckooCache quietly evicts data when it fills up - I'm curious to see how you'll ensure that we can coordinate eviction with a guaranteed flush.

@Sjors
Copy link
Member

Sjors commented Aug 28, 2019

Concept ACK for improving std::unordered_map behavior or swapping it for an alternative if the improvement is >10%. It's nice how this only changes 1 line in coins.h.

I'm able to reproduce the memory gain, but the speed is actually worse for me, on a 2019 MacBook Pro with -dbcache=10000, no indexes and no wallets, syncing over LAN to a single node:

  • IBD before: 3:35 hours plus 11 minute db flush (memory at height 564180: 8.23 GB)
  • IBD after: 5:30 hours plus 20 minute db flush (memory at height 564180: 7.5 GB)

@martinus
Copy link
Contributor

Hi all! I am the author of the the robin_hood::unordered_map variants, and just saw this PR by accident. Please ask away if you have any question about my implementation!

I have also done extensive map benchmarks here, which might be interesting: https://github.com/martinus/map_benchmark

I agree with @JeremyRubin that the throwOverflowError() is a potential problem, if there is a way to force collisions in the map. Be aware that new hasmaps (facebook's F14, Google's Swisstable) also depend on good hashes. They don't throw an exception, but performance can degrade so much that it's not much different from an infinite loop.

@@ -64,7 +64,7 @@ class CCoinsViewTest : public CCoinsView
map_.erase(it->first);
}
}
mapCoins.erase(it++);
it = mapCoins.erase(it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that there is a bug in robin_hood map martinus/robin-hood-hashing#42 (as well as in tessil's map Tessil/robin-map#20) where it is possible that entries are iterated over a second time before end() is reached.

@elichai
Copy link
Contributor

elichai commented Aug 28, 2019

@martinus would love to see some list of projects you know are depending and using your implementation :)

I think if people would feel this is a stable and vetted implementation they would feel better about considering using it in consensus critical code.
Thanks :)

@martinus
Copy link
Contributor

@elichai, we have been using the map in several core components in Dynatrace's proprietary monitoring platform for quite a while. That's a very well tested and widely deployed product on multiple platforms (I work at dynatrace so I might be biased..).

I don't know much about other use cases. Here is a github query tough to find other uses of the map: https://github.com/search?q=robin_hood_h_included&type=Code It only finds usages who copied the file though.

Some usages that I had a look at:

@TheBlueMatt
Copy link
Contributor

Right, pulling in Rust for something like this is way too early/aggressive. I'm curious if its possible to get similar speedups by taking a hatchet to robinhood so that we can cut it down to something thats more reviewable, since obviously thats the key criteria.

@martinus
Copy link
Contributor

It would be very interesting where exactly the speedup is coming from. It might be possible to reproduce this speedup by staying with std::unordered_map. E.g. I see two possibilities for improvement:

  • robin_hood map requires less hashing and less key comparisons than std::unordered_map. It might be possible to save a lot of hashing & comparisons by simply adding the precalculated hash value to the key (so instead of the COutPoint use e.g. std::pair<COutPoint, size_t>. This unfortunately will increase memory usage a bit (but not much, since the key/value pair is already quite large).

  • I use a bulk allocator that never frees any memory, but reuses previously allocated memory. So once the cache has reached its peak size, no allocations will be necessary any more. This can also be achieved with a custom allocator for std::unordered_map.

  • If the speedup comes from the robin_hood style memory layout, that's bad luck - can't be achieved with std::unordered_map.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 28, 2019

First -reindex-chainstate bench finished: 46.7% 2.14x faster with 1173 MB less memory usage.

master (a7be1cc):

user@bench-ssd-4:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch

        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
        User time (seconds): 7735.97
        System time (seconds): 1081.16
        Percent of CPU this job got: 53%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32:10
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 7563764
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 15004449
        Minor (reclaiming a frame) page faults: 61431800
        Voluntary context switches: 22076432
        Involuntary context switches: 4492039
        Swaps: 0
        File system inputs: 2826327856
        File system outputs: 165988936
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

2019-08-robinhood:

user@bench-ssd-3:~/bitcoin$ /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000; git branch

        Command being timed: "./src/bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000"
        User time (seconds): 5598.92
        System time (seconds): 720.66
        Percent of CPU this job got: 82%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:07:54
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 6361924
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1684815
        Minor (reclaiming a frame) page faults: 73715700
        Voluntary context switches: 7033296
        Involuntary context switches: 1901858
        Swaps: 0
        File system inputs: 502322648
        File system outputs: 177410624
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@laanwj in terms of your concerns, I couldn't agree more. The last thing I want is to be responsible for a consensus failure.

I wonder if these performance gains indeed turn out to be real, we could devote a substantial amount of effort to scrutinizing this map implementation using e.g. fuzz + stress testing and thorough manual review. After some period (months), we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it.

That said, in the grand scheme of things maybe even a 50% speedup in IBD is negligibly marginal relative to the risk.

@jamesob
Copy link
Contributor Author

jamesob commented Aug 28, 2019

My mistake - that's 2.14x faster, not 46.7%.

@martinus
Copy link
Contributor

master (a7be1cc):

        User time (seconds): 7735.97
        Major (requiring I/O) page faults: 15004449

2019-08-robinhood:

        User time (seconds): 5598.92
        Major (requiring I/O) page faults: 1684815

I find it a bit fishy that the major page faults with robinhood are 10x less. Could disk cache affect the benchmark? Maybe drop the cache before benchmarking, I think sync; echo 3 > /proc/sys/vm/drop_caches should work

@maflcko
Copy link
Member

maflcko commented Aug 28, 2019

we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it

I disagree about a phased out deployment. Either the consensus code is safe to run, correct, and can be deployed to the whole network, or not. There is no in-between. The last thing you want to see happening is that all arm (or whatever "hardware-constrained means) devices are off consensus.

@martinus
Copy link
Contributor

I've benchmarked bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my machine (Intel i7-8700, g++ 9.1, libstdc++), and can confirm a significant speedup with @jamesob's changes. For accurate reproducability I have disabled frequency scaling and turbo boost for my cpu (with sudo pyperf system tune). Runtime with std::unordered_map is 6471 seconds, and 5339 seconds with robin_hood map, 21% faster.

I've modified SaltedOutpointHasher and added a custom equals function to count the number of hashes and equal operations performed. To my surprise, on my system std::unordered_map needs less hash calls. It looks like it caches the full hash. (10.7 vs. 13.2 billion hashes). I've benchmarked just the hashing and equals operations, and this should account for about 323 vs. 391 seconds of CPU total.

Next I'm going to try to use a bulk pool allocator for std::unordered_map and see how that goes.

@martinus
Copy link
Contributor

I have tried to use boost::pool_allocator and boost::fast_pool_allocator for the std::unordered_map, but performance was much slower than std::allocator.

Then I adapted my BulkPoolAllocator to be usable for the std::unordered_map, and results are much better.

Benchmark results for bitcoind -reindex-chainstate -stopatheight=550000 -printtoconsole=0 -dbcache=5000 on my i7-8700:

unordered_map robin_hood unordered_map + BulkPoolAllocator
elapsed h:mm:ss 1:47:51 1:28:59 1:32:56
speedup % 0% 21.2% 16.1%
max RSS kbyte 7553716 6301540 6950432
max RSS % 100% 83.4% 92.0%

So most of the savings can also be achieved by staying with std::unordered_map and adding the BulkPoolAllocator. This should be much easier to integrate since it's less than 200 lines of code. Also, the allocator might make sense for other node-based containers as well.

@JeremyRubin
Copy link
Contributor

I think we may also be able to eke out some additional performance by making the keys the sha256 salted hash of the COutpoint rather than the COutpoint itself, and instead of using siphash taking entropy from that salted hash (as done for cuckoocache).

At the expense of one extra sha256 on insert/find (and less siphash), we get to drop 4 of the key bytes and have cheaper hash re-evaluation. Furthermore (this is a little bit platform dependent, but maybe we can detect it or wrap the hash in an is_integral type) we can set __cache_hash_code in the map to false, and save another 8 bytes. Depending on padding issues we may save even more space.

@jamesob
Copy link
Contributor Author

jamesob commented Sep 3, 2019

Thanks for running those benchmarks, @martinus. I also think we should try calling cacheCoins.reserve(dbcache / avg_coinscacheentry_size) somewhere during startup with both implementations to see how that affects things.

On the chance that we still want to consider using the robin_hood map, I've stripped about 500 lines out of robinhood.h (mostly relating to unordered_flat_map which I don't think we have much use for). I refer to this stripped down version as robinhood-slim in the benchmarks below (and it's now the tip of this branch).


I've been running benchmarks pretty continuously for the last few days and here's a summary of the SSD-based timings:

Selection_153

Unsurprisingly it seems that performance improvement is most pronounced with a large dbcache. Presumably this is because as the number of entries in cacheCoins grows, std::unordered_map's performance starts to degrade, robin_hood starts to shine, or both.

I found it interesting that over numerous runs, a large dbcache (7000) ended up being slower across the board than a smaller dbcache (6000 or 3000).

In general, the weird timing disparities from these results give me pause. I'm benching on dedicated, non-virtualized hardware that is for no purpose other than benchmarking, and I'm dropping caches before runs (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;). Yet we see weird outliers like the 500-minute robinhood reindex at dbcache=5000 - which I didn't believe initially but then reproduced. Why would a dbcache=7000 run take longer than a dbcache=3000? Is the runtime of -reindex-chainstate somehow weirdly determined by the dbcache value during IBD?

I'm quite confused by some aspects of these results.

Anyway, here's the memory story:

Selection_154

We reliably see significant memory savings from the robinhood map.


I also ran -reindex-chainstate to 550,000 on equivalent machines with HDDs. The disparity between master and this branch has been almost unbelievable, but I've reliably reproduced it even after doing machine crossover. The following table is characteristic of the -reindex-chainstates I've done on HDD:

branch dbcache time memory
robinhood 5000 8:46:59 6.38 GB
master 5000 21:32:58 (stopped early) 7.47 GB

Similarly, IBD to the same height with the same dbcache took 47 hours on master (!?) and 17 hours on this branch.

I'm just kind of assuming these results are bogus somehow, although I'm very confused by the fact that I can reliably reproduce them even after swapping machines, but maybe someone with a spinning disk can confirm or deny that there's a dramatic difference in performance between this branch and master.

@maflcko
Copy link
Member

maflcko commented Sep 3, 2019

Thanks for doing those additional benchmarks @jamesob. I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)

@jamesob
Copy link
Contributor Author

jamesob commented Sep 3, 2019

I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment)

Note though that I differ from Sjors' findings in that this branch was ~2x faster than master for dbcache=7000 (reindex-chainstate), whereas in his results this branch was slower than master for a large dbcache (10000) during IBD.

I lightly suspect that @Sjors' slowness on this branch relative to master was due to some kind of IBD-inherent variability since on every other benchmark run I've seen, this branch performs better than master.

Sjors' total IBD runtime (<5 hours) was also less than my reindex-chainstate runtime (up to 8 hours), which I attribute to his probably-much-better hardware.

@martinus
Copy link
Contributor

martinus commented Sep 3, 2019

In https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator I'm currently working on a PR where I've extracted (and completely rewrote...) the pool allocator so it is usable in std::unordered_map. Unfortunately the standard allocator API is trickier than I've expected, but I think I got it now. It's performance is still as advertised in #16718 (comment). I still have to clean up the code and add some unit tests before I'm creating a PR, probably in the next few days.

@luke-jr
Copy link
Member

luke-jr commented Sep 20, 2019

Surprised nobody's pointed out that using std::unordered_map basically gives us an unknown implementation right now... which probably is best avoided in consensus code, when possible. The fact that we can get performance improvements too really suggests to me that switching to a specific implementation would be a good idea. So concept ACK.

Agree with need for intense care and testing to avoid a surprise protocol change. Bundling like LevelDB seems best if it's a lot of code.

As far as issues with bad hashes - we don't want to enable intentionally-crafted transactions to screw over performance, so might need to re-hash with some seed?

@ryanofsky
Copy link
Contributor

Status of this?

  • Does using std::unordered_map with custom allocator Improve speed, memory efficiency with alternate hashmap #16718 (comment) still seem like it could be a good approach?

  • Are the code changes in this PR more or less complete, or is there more robinhood code streamlining to be done?

  • Are there still concerns about performance will smaller dbcaches and more constrained hardware with these implementations?

  • Are there still benchmark reproducibility issues? Are these affecting other setups or just James' benchmarks?

@martinus
Copy link
Contributor

Status of this?

I think using a custom allocator from #16801 is a much safer choice than switching to a different map implementation.Writing a correct and fast hashmap is orders of magnitudes more difficult than a custom allocator. This PR was good though that it has highlighted what could be possible. Most of the performance benefit can be achieved with the custom allocator as well

Concerning benchmark reproducibility, I believe @jamesob has stumbled upon a pretty severe performance bug, but we don't know yet if this is due to the new code or maybe the optimizations trigger an existing bug more often: #17060 (comment)

@jamesob
Copy link
Contributor Author

jamesob commented Nov 14, 2019

Some new (and stabilized) benchmarks here. Following the same methodology as in similar PRs, we see a 6% difference in runtime with ~400MB less memory usage for a reindex up to 550,000 (bench/robinhood.1 vs. bench/master.1):

host         tag                      time       time% maxmem  cpu%  dbcache
bench-ssd-2  robinhood.1              8:43:36    0.94  5304.37MB 356%  4000MB
bench-ssd-2  robinhood.1              8:44:29    0.94  5309.67MB 355%  4000MB
bench-ssd-2  master.1                 9:15:02    1.00  5776.95MB 344%  4000MB
bench-ssd-2  master.1                 9:15:26    1.00  5762.75MB 343%  4000MB

bench-ssd-3  robinhood.1              8:44:41    0.94  5293.84MB 355%  4000MB
bench-ssd-3  robinhood.1              8:46:29    0.94  5309.91MB 354%  4000MB
bench-ssd-3  master.1                 9:18:07    1.00  5780.84MB 342%  4000MB
bench-ssd-3  master.1                 9:15:53    1.00  5773.25MB 343%  4000MB

bench-ssd-4  robinhood.1              8:46:13    0.95  5279.41MB 354%  4000MB
bench-ssd-4  robinhood.1              8:45:36    0.94  5294.21MB 354%  4000MB
bench-ssd-4  master.1                 9:14:42    1.00  5782.72MB 344%  4000MB
bench-ssd-4  master.1                 9:16:17    1.00  5762.92MB 343%  4000MB

bench-ssd-5  robinhood.1              8:42:02    0.94  5277.88MB 357%  4000MB
bench-ssd-5  robinhood.1              8:41:34    0.94  5302.08MB 357%  4000MB
bench-ssd-5  master.1                 9:12:43    1.00  5773.20MB 345%  4000MB
bench-ssd-5  master.1                 9:12:08    1.00  5769.65MB 345%  4000MB

bench-strong robinhood.1              14:16:20   0.94  5302.65MB 563%  4000MB
bench-strong master.1                 15:07:08   1.00  5804.43MB 543%  4000MB
bench-strong master.1                 15:08:44   1.00  5722.19MB 543%  4000MB

Interesting, but probably not a sufficient benefit to justify adding an extra 1500 lines of consensus-critical code. I'm going to close this one and perhaps someone can revisit this later.

@jamesob jamesob closed this Nov 14, 2019
@elichai elichai mentioned this pull request May 5, 2020
4 tasks
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.