-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Improve speed, memory efficiency with alternate hashmap #16718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yap that would be interesting.
I'd be surprised to see such improvements here.
Either way it should include tests. |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
@jamesob Interesting! Thanks for providing benchmarks! Which implementation of It would be interesting to see this benchmarked against the three major implementations of |
I've been benching on Ubuntu and Debian systems with |
Nice. This is the kind of PR I enjoy reading / reviewing / testing. I did a basic benchmark on my macOS machine: #16718 6f9882c - 48m49s Will run something to a more intense height later. |
Some additional context from further benchmarking: Local testsI did a local IBD (not real network) comparison up to height 510,000. We see a speedup of roughly 17% coupled with memory savings of roughly 13.5%. Given the vertical shift in real memory usage curve on the robinhood branch (pictured below), I suspect there's a problem with my implementation of the 2019-08-robinhood vs. master (absolute)
2019-08-robinhood vs. master (relative)
IBD from the real networkI let IBD run overnight up to height 550,000 using peers from the real network for both master and this branch. The differences in speed were not as compelling here (a 6% gain for the robinhood branch) but memory differences were still pronounced: 14.6% percent savings for the robinhood branch (6502MB vs. 7615MB, both at a dbcache of 5000).
The "real" gains here dampen my enthusiasm somewhat, but the memory story is compelling. Curious for others to weigh in on whether these differences merit the increased risk of bringing in a non- |
6f9882c
to
33b9786
Compare
Consider my enthusiasm renewed because the latest "real" IBD runs came back showing a hard-to-believe 40% speedup. Sync to 550,000 on different machines but identical hardware. Still a 12% memory savings. Obviously it's hard to gauge variance when syncing from the real network.
|
Concept ack, NACK this implementation. From the docs:
There are some ways that we could maybe get some better performance with STD::unordered_map, will take a look and give some feedback later. |
Thanks for the close look @JeremyRubin. This is exactly the kind of change I'd be really worried about introducing. The performance difference will have to be really compelling (and verified by a few people aside from myself) to consider using a non- I wonder if, assuming we did find a non-std implementation that we're comfortable with, we should initially leave its use as an opt-in configure flag that, e.g., installations running on lower-power hardware can make use of. |
Alternatively, we could look at forking |
Concept ACK for replacing the hashmap, |
@jamesob maybe have a look at the experimental rust fork? You could just wrap the rust hasmap (with at least rust v1.35 or use hashbrown crate). Rust's hashmap is based on swisstable. Replacing sensitive data structures with rust seems like a good use case given long term goals, and can be used for low resource nodes experimentally? |
@JeremyRubin I do hope I'll have time to test the C++ SwissTable implementation and compare benchmarks https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h |
I agree with @elichai - I think adding Rust as a dependency for one of the most consensus-sensitive data structures is way too big a leap. A few extra thousand lines of heavily-scrutinized cpp is about the biggest risk I'm willing to take personally, and that's assuming a ton of review, testing, and maybe a gradual deployment process. |
|
You can use bitcoinperf to perform the host-local comparative benchmarks I've posted above. I've committed an example file for this branch here - note that you're going to have to change some of the paths and potentially create some partially synced datadirs to be able to use it.
In the case of That said, I'm happy to benchmark against a
Yeah, I do, though I don't expect any other piece of data in the system will give us such dramatic improvements in performance. If we're going to use this map for arguably the most sensitive data structure in the system, we might as well use it elsewhere. I think a compelling candidate would be |
Sounds good :) Thanks! |
Ok I've had a chance to more tightly review the code here. I didn't want to say so earlier, because I wasn't sure, but I think it should be possible to modify the cuckoocache code minimally to support cacheCoins (store a key/value and (maybe) get rid of atomics for erasure). I can put together a PoC of that... there are some other benefits as CuckooCache is specifically a cache and not a map being used as a cache. This is already a consensus dependency, and has been in since v0.14, so it introduces less new code. |
Latest round of benchmarks are in from two dedicated bench machines with identical specs (Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz, 8GB memory, SSD). First I reran the last test I reported above ( master (a7be1cc):
this branch:
This is a remarkable performance improvement that I'm eager to see reproduced by others. |
@JeremyRubin I'd be very eager to see your PoC for this. In principle I strongly agree that if we can modify existing consensus-supporting data structures and see similar performance benefits, that's the way to go.
I'm curious about what you mean here, and it highlights one of my concerns about repurposing CuckooCache. As you probably know, we can't just evict stuff from cacheCoins - we have to be very careful about persisting cache content ( |
Concept ACK for improving I'm able to reproduce the memory gain, but the speed is actually worse for me, on a 2019 MacBook Pro with
|
Hi all! I am the author of the the I have also done extensive map benchmarks here, which might be interesting: https://github.com/martinus/map_benchmark I agree with @JeremyRubin that the |
@@ -64,7 +64,7 @@ class CCoinsViewTest : public CCoinsView | |||
map_.erase(it->first); | |||
} | |||
} | |||
mapCoins.erase(it++); | |||
it = mapCoins.erase(it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware that there is a bug in robin_hood map martinus/robin-hood-hashing#42 (as well as in tessil's map Tessil/robin-map#20) where it is possible that entries are iterated over a second time before end()
is reached.
@martinus would love to see some list of projects you know are depending and using your implementation :) I think if people would feel this is a stable and vetted implementation they would feel better about considering using it in consensus critical code. |
@elichai, we have been using the map in several core components in Dynatrace's proprietary monitoring platform for quite a while. That's a very well tested and widely deployed product on multiple platforms (I work at dynatrace so I might be biased..). I don't know much about other use cases. Here is a github query tough to find other uses of the map: https://github.com/search?q=robin_hood_h_included&type=Code It only finds usages who copied the file though. Some usages that I had a look at:
|
Right, pulling in Rust for something like this is way too early/aggressive. I'm curious if its possible to get similar speedups by taking a hatchet to robinhood so that we can cut it down to something thats more reviewable, since obviously thats the key criteria. |
It would be very interesting where exactly the speedup is coming from. It might be possible to reproduce this speedup by staying with
|
First master (a7be1cc):
2019-08-robinhood:
@laanwj in terms of your concerns, I couldn't agree more. The last thing I want is to be responsible for a consensus failure. I wonder if these performance gains indeed turn out to be real, we could devote a substantial amount of effort to scrutinizing this map implementation using e.g. fuzz + stress testing and thorough manual review. After some period (months), we could make its use opt-in via configure flag to allow some hardware-constrained installations to use it. That said, in the grand scheme of things maybe even a 50% speedup in IBD is negligibly marginal relative to the risk. |
My mistake - that's 2.14x faster, not 46.7%. |
I find it a bit fishy that the major page faults with robinhood are 10x less. Could disk cache affect the benchmark? Maybe drop the cache before benchmarking, I think |
I disagree about a phased out deployment. Either the consensus code is safe to run, correct, and can be deployed to the whole network, or not. There is no in-between. The last thing you want to see happening is that all |
I've benchmarked I've modified Next I'm going to try to use a bulk pool allocator for std::unordered_map and see how that goes. |
I have tried to use Then I adapted my BulkPoolAllocator to be usable for the Benchmark results for
So most of the savings can also be achieved by staying with |
I think we may also be able to eke out some additional performance by making the keys the sha256 salted hash of the COutpoint rather than the COutpoint itself, and instead of using siphash taking entropy from that salted hash (as done for cuckoocache). At the expense of one extra sha256 on insert/find (and less siphash), we get to drop 4 of the key bytes and have cheaper hash re-evaluation. Furthermore (this is a little bit platform dependent, but maybe we can detect it or wrap the hash in an is_integral type) we can set __cache_hash_code in the map to false, and save another 8 bytes. Depending on padding issues we may save even more space. |
33b9786
to
8739248
Compare
Thanks for running those benchmarks, @martinus. I also think we should try calling On the chance that we still want to consider using the I've been running benchmarks pretty continuously for the last few days and here's a summary of the SSD-based timings: Unsurprisingly it seems that performance improvement is most pronounced with a large dbcache. Presumably this is because as the number of entries in cacheCoins grows, I found it interesting that over numerous runs, a large dbcache (7000) ended up being slower across the board than a smaller dbcache (6000 or 3000). In general, the weird timing disparities from these results give me pause. I'm benching on dedicated, non-virtualized hardware that is for no purpose other than benchmarking, and I'm dropping caches before runs ( I'm quite confused by some aspects of these results. Anyway, here's the memory story: We reliably see significant memory savings from the robinhood map. I also ran -reindex-chainstate to 550,000 on equivalent machines with HDDs. The disparity between master and this branch has been almost unbelievable, but I've reliably reproduced it even after doing machine crossover. The following table is characteristic of the -reindex-chainstates I've done on HDD:
Similarly, IBD to the same height with the same dbcache took 47 hours on master (!?) and 17 hours on this branch. I'm just kind of assuming these results are bogus somehow, although I'm very confused by the fact that I can reliably reproduce them even after swapping machines, but maybe someone with a spinning disk can confirm or deny that there's a dramatic difference in performance between this branch and master. |
Thanks for doing those additional benchmarks @jamesob. I think that confirms the finding by @Sjors that the sync is slowed down with a high dbcache in #16718 (comment) |
Note though that I differ from Sjors' findings in that this branch was ~2x faster than master for dbcache=7000 (reindex-chainstate), whereas in his results this branch was slower than master for a large dbcache (10000) during IBD. I lightly suspect that @Sjors' slowness on this branch relative to master was due to some kind of IBD-inherent variability since on every other benchmark run I've seen, this branch performs better than master. Sjors' total IBD runtime (<5 hours) was also less than my reindex-chainstate runtime (up to 8 hours), which I attribute to his probably-much-better hardware. |
In https://github.com/martinus/bitcoin/tree/2019-08-bulkpoolallocator I'm currently working on a PR where I've extracted (and completely rewrote...) the pool allocator so it is usable in std::unordered_map. Unfortunately the standard allocator API is trickier than I've expected, but I think I got it now. It's performance is still as advertised in #16718 (comment). I still have to clean up the code and add some unit tests before I'm creating a PR, probably in the next few days. |
Surprised nobody's pointed out that using Agree with need for intense care and testing to avoid a surprise protocol change. Bundling like LevelDB seems best if it's a lot of code. As far as issues with bad hashes - we don't want to enable intentionally-crafted transactions to screw over performance, so might need to re-hash with some seed? |
Status of this?
|
I think using a custom allocator from #16801 is a much safer choice than switching to a different map implementation.Writing a correct and fast hashmap is orders of magnitudes more difficult than a custom allocator. This PR was good though that it has highlighted what could be possible. Most of the performance benefit can be achieved with the custom allocator as well Concerning benchmark reproducibility, I believe @jamesob has stumbled upon a pretty severe performance bug, but we don't know yet if this is due to the new code or maybe the optimizations trigger an existing bug more often: #17060 (comment) |
Some new (and stabilized) benchmarks here. Following the same methodology as in similar PRs, we see a 6% difference in runtime with ~400MB less memory usage for a reindex up to 550,000 (
Interesting, but probably not a sufficient benefit to justify adding an extra 1500 lines of consensus-critical code. I'm going to close this one and perhaps someone can revisit this later. |
tl;dr
I try swapping out
std::unordered_map
for a faster third-party implementation where it matters, see something like 15% speedup for initial block download coincident with a 19.3% reduction in memory usage.When profiling initial block download, it's evident that a decent chunk of time on the critical path is spent in
CCoinsViewCache
, the data structure responsible for the in-memory representation of the UTXO set.Profile of the
msghand
thread at height ~300,000.The essential data member of the cache,
cacheCoins
, is astd::unordered_map
that holds as many UTXO as can fit into memory given thedbcache
setting. While this map implementation is definitely preferable tostd::map
(which retains order and has O(log(n)) lookups insted of O(1)), both stdlib maps are subpar relative to non-std
alternatives.After seeing cacheCoins as a bottleneck, I decided to try swapping one of these alternative implementations in to see if there is any benefit. It looks like there is.
For 20,000 blocks at a representative part of the chain, we see something like a 15% speedup with a 19.3% reduction in memory usage (dbcache was set to 3000).
The
CCoinsCaching
benchmark shows improvement but not as dramatically as IBD does:Obviously adding 2000 lines of new consensus-critical code isn't appealing, but if those numbers hold for the full IBD process I think there might be a compelling case to be made for the added code.
Implementations considered
Of the few best-looking options, I picked the one with the most succinct implementation and fewest dependencies,
robin_hood::unordered_node_map
(it's just a header file). I also looked at Facebook'sF14NodeMap
and Abseil'snode_hash_map
but both required too much elbow-grease (for now) to integrate here without pulling in a whole lot of code.Next steps
I'm going to be running more comprehensive benchmarks, including a near-full IBD from the network, to see if these improvements hold.
I'm also interested in benching this along with a similar change to
mapBlockIndex
(currently astd::unordered_map
).Questions
If we decide to include this or another hashmap implementation, should we do a full fork inclusion a la leveldb or should we try to slim down the implementation to a single header file as is done here (but more)?