-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Add allocator for node based containers #22702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a330e5
to
cde8a99
Compare
Concept ACK The numbers look very promising! Seems like this could be one of those rare optimization opportunities actually worth pursuing :) |
I'm curious what the impact would be on the performance if you allocated N chunks at a time, always, to fill a page. E.g., if you want to allocate one node, always allocate something like Cool properties of doing so: Better cache alignment on items added around the same time, fewer things to keep in the allocated chunks list, if you were to sort the list by pointer (perhaps lazily, when we're not doing anything else) you would get something with good cache alignment again. |
I've run the benchmark with
|
Do you have specific code for that? The changes sounded a bit more complex than just aligning the allocator. Did you divide the allocation into N free list chunks? Or just over-allocate? |
So for node size of 104 bytes and page size 4096, this should allocate a page-alligned block with 39*104=4056 bytes. |
Some fresh benches in (high dbcache). Continuing to see fairly substantial time/memory improvements with this change. commands index
#22702 vs. $mergebase (absolute)
#22702 vs. $mergebase (relative)
|
Thanks for the benchmark! Good to see its 8.4% faster. I think the only reason the memory is lower is because the |
hmm yeah it's interesting to think through the alignment issues. I think I understand the code a bit better now so it makes sense that most of you allocations are already aligned. One thing that strikes me is that it seems that you have these 104 byte chunks (13 uint64_t s). You might get better properties if you pad them out to the nearest 32/64/128 bytes (memory overhead 24 bytes :( ), because then you'll never have nodes across cache lines. Memory usage is a bit worse though, but cache perfomance may be better this way? https://www.usenix.org/legacy/publications/library/proceedings/als00/2000papers/papers/full_papers/sears/sears_html/index.html for reference. |
might also be interesting to bump the CScript prevector size from 28 to 35 (7 bytes). This would help with the above as well because if we're 104/128 byte aligned we have the capacity & then we'd also save a prevector allocation (which means the vector costs 25 bytes + 34-35 bytes) for a non p2sh segwit v0-v1 output, and also means even more indirection. This might have system impacts very broadly because prevector is a leaky optimization sadly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept/Approach ACK. Interesting work! Some initial review feedback at jonatack@df35520
I've benchmarked the performance of I've oriented myself on @jamesob's parameters and synced the blocks 500.000 to 540.000 from a localhost peer three times for the PR and the MB. I used the default dbcache (shouldn't matter for my measurements as cache flushes don't happen while inside Total time spent in
My results confirm that less time is spent in To rule out temperature effects on performance, these measurements could be redone with a properly cooled server/workstation. EDIT: I did another three runs of PR and MB with calling
|
@JeremyRubin I actually tried to bump it up to 128 byte to see what that does, but I didn't see any major difference. The benchmark got maybe 1% slower, but maybe the benchmark is also not very realistic. I also tried to reduce the size by 8 byte (which can be done by making use of the @jonatack Thanks for the review! I'm not sure though how I should best incorporate your changes? Should I copy them and add a @0xB10C Thanks for the benchmark! You might get a bit more stable result with pyperf: I always use |
Thanks! I'd say squash in the changes you want to keep. You can add an empty line followed by |
Bench results with more modest dbcache (800): 6% speedup, 6.7% less memory usage.
|
3574aee
to
22832f6
Compare
I've now updated the code with correct memory estimation, and benchmarked it. I had a fully synchronized node, and ran /usr/bin/time -v bitcoind -datadir=/run/media/martinus/big/bitcoin/db -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000 The PR ran quite a bit faster than master, 20.8%, with practically the same memory usage. Here some interesting data from
Here are some graphs that I created from parsing the
Due to the different allocation behavior (it allocates one big chunk of memory right as the first Coin is added), I also had to update the checks in |
22832f6
to
6fa1a72
Compare
698dd3f
to
25cf765
Compare
Unfortunately Microsofts implementation of unordered_map behaves quite a bit different from libc++ and libstdc++, so the heuristic that I'm using to detect the node size doesn't work. I've added a WIP to the header until I've fixed this. Most likely I'll ditch the heuristic completely, and properly calculate the correct size. I'm already doing that in the tests anyways (except for windows) |
25cf765
to
c7cc614
Compare
This allocator is potentially useful for all node-based containers like `std::list`, `std::map`, `std::unordered_map` that are heavily used. The allocator uses pools to be more memory efficient and faster than the default std::allocator. Currently this provides specializations for `std::unordered_map`, and other specializations can be added as needed. This updates the `memusage::DynamicUsage` calculation to take the node_allocator allocator into account. Also, the dynamic usage for standard `std::unordered_map` is made more accurate with the new node_size calculation. Rule of 5, drop unused function, spelling fixups and more code cleanup by jonatack, countless suggestions & improvements by ryanofsky. Co-authored-by: Jon Atack <jon@atack.com> Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
In my benchmarks, using this pool allocator for CCoinsMap gives about 20% faster `-reindex-chainstate` with -dbcache=5000 with practically the same memory usage. The change in max RSS changed was 0.3%. Note that the memory usage behavior is now different, since MemoryResource does not free any memory when the map is cleared. Instead, the memory is held in a freelist. This requires now that `CCoinsViewCache::Flush` needs to call `ReallocateCache` so memory is actually freed. If that wouldn't happen, memory usage would not decrease and Flush() would be triggered again soon after. Also, the `validation_flush_tests` tests need to be updated because memory allocation is now done in large pools instead of one node at a time, so the limits need to be updated accordingly.
Removes the call to CoinsTip().ReallocateCache() in ResizeCoinsCaches because this is now done in CCoinsViewCache::Flush(). Also makes ReallocateCache private because it's not needed outside any more.
472f5b0
to
95c53bf
Compare
Rebased to resolve conflict in |
(Will re-review this PR soon.) |
I have been testing this on a busy node (~400 connections) for more than a month. I have noticed no crashes, no memory leaks, and no other unexpected behavior. |
I see a number of unaddressed comments here by @ryanofsky. What is the status of those? |
I've now gone through all unaddressed comments and marked all as resolved that I consider fully resolved. As far as I can say only nits remain |
Still no issues! Still, to be honest, I think this change is a bit scary, and I have a hard time signing off on it. The reason is that it adds quite a lot of complexity, and depends on quite some C++ magic and cleverness (e.g. relying on |
I guess it is, it got a lot more complex that I initially anticipated. I don't think it can be made much simpler. I'm pretty sure it works, but obviously there's no 100% guarantee. So I'm fine with it not being merged, at least then I can't be responsible for a bitcoind crash 😅 Since the code here is not tightly coupled to bitcoind, when I have the time I will try to create a generic c++ library out of it. Thanks everyone who reviewed & tested or otherwise spent time on this! |
@martinus this pull was discussed yesterday at the weekly meeting (see https://www.erisian.com.au/bitcoin-core-dev/log-2022-06-02.html#l-358). I've been running it without issues as well for some time and am concept ACK (but need to re-review carefully). @sipa wrote "I need to look at the latest code changes again, but I'm very concept ACK on that allocator", followed by @laanwj "to be clear i'm not against it, but we need to be really sure it gets enough review". For info :) |
I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)? |
I didn't mean for it to be closed @martinus. Of course, you're free to. Making a generic library sounds like a good idea. We could subtree or import it once it's "proven". I just wanted to voice my concern. If we can drum up enough reviewers here with the appropriate knowledge about C++ internals (and it doesn't all rest on @sipa's shoulders, for example), it'd be good. Something that was proposed in the meeting yesterday was an advanced review club.
Memory corruption resulting in consensus problems would be the worst outcome. Other crashes (especially if hard to debug / diagnose) would be bad too, of course. |
I think the major concern here could be summarized as "we're worried about unknown unknowns." Swapping out an allocator (or as in the original case of #16718, the whole map implementation) entails displacing a lot of low-level C++ code that is widely deployed and tested by the standard library with code that we (or if not just us, a much smaller subset of the software world) maintain and test. If anything goes wrong with the UTXO set, it's no news to everyone that it certainly spells big issues for bitcoin. While the performance gains here are very enticing, and indeed @luke-jr earlier made the interesting point that maybe moving this critical code into our domain of control (and out of the black box of std) may be preferable, since we would explicitly understand and maintain a very core dependency of the validation process, this "mountain man" approach obviously has an associated cost: maintaining such an important and sensitive piece of the system represents a liability that we would no longer shoulder alongside the wider C++ usage base. I originally proposed the hashmap implementation change that (I think?) inspired @martinus to undertake this work. I'm glad he has done this work, even if it remains experimental, because it's important for us to understand this design space better. But that said I can understand why @laanwj would express an abundance of caution about this kind of thing, and it resonates with me. Everyone here understands that a change like this should not be done lightly, but even given that, it's hard to know when we've "tested enough" to know that this will be safe for perpetuity. Unknown unknowns. All that said, I'm far from the most qualified systems guy on this project, so it would be interesting to hear from others like @sipa @ryanofsky @theuni @TheBlueMatt. Maybe we can fuzz this to death on a bunch of different platforms and convince ourselves that it's as safe or safer than using std's allocator. Such a determination is out of my ability. |
I think I've found a way to get almost all of the benefit of the |
@martinus I started working on a fuzz test for the allocator, to show at least its behavior w.r.t. its external interface is correct. If you're going to simplify things: will there still be an allocator class, and a concept of a memory resource? If so, will the memory resource still be parametrized by the size of the pooled objects in it? |
That's awesome!
There will be just a memory resource, which will be passed to a Also, instead of having to correctly guess the pool's memory size I simply use a vector with multiple pools of different size. Most of them will stay empty, but that overhead is negligible. Usage is like so: using CCoinsMap = std::pmr::unordered_map<COutPoint, CCoinsCacheEntry, SaltedOutpointHasher>;
auto mr = NodePoolResource<256>(); // large enough to fit nodes size
auto map = CCoinsMap{0, SaltedOutpointHasher{}, std::equal_to<COutPoint>{}, &mr}; |
@martinus Cool, that looks a lot simpler. The runtime indirection from the polymorphic resource does sound like it'd introduce a performance penalty, so I think we'll want benchmarks to see if it doesn't hurt. A few thoughts:
|
Thanks @spia for having a close look!
I can do that and it probably makes the resource more widely usable, but at least for the CCoinsMap it won't make a difference because it has the same alignment as the
Yes! I originally wanted to get rid of the template argument, but using std::array is actually a bit faster in my benchmarks so I'll go that route. I'm working on that in this branch, where I'm now using std::array and rounding up: master...martinus:2022-06-very-not-scary-NodePoolResource I hope to have something PR-ready in a week or two. |
@martinus Here is a generic memory resource fuzzer: https://github.com/sipa/bitcoin/commits/202206_pmrallocfuzz It also includes a somewhat modified version of your code, with a few more template parameters/sanity checks, and the rounding as I was thinking of it. Seems to work great. Feel free to try on top of your version of the resource, or combine ideas/whatever. I like making the code generally usable in wider scenarios than just the ones we care about, as it means generic tests can be written for it, which need fewer assumptions. Another idea for something to include in the tests which I haven't done: add a member function to the resource to ask if it still has any outstanding allocations (by summing the size of all allocated chunks, and comparing with the sum of all elements in the freelist). |
Excellent idea, I've now added an even stronger check: make sure that all data in the freelist actually comes from the chunks, doesn't overlap in any way, and each byte from the chunks can be found in the freelists or is still available |
9f947fc Use PoolAllocator for CCoinsMap (Martin Leitner-Ankerl) 5e4ac5a Call ReallocateCache() on each Flush() (Martin Leitner-Ankerl) 1afca6b Add PoolResource fuzzer (Martin Leitner-Ankerl) e19943f Calculate memory usage correctly for unordered_maps that use PoolAllocator (Martin Leitner-Ankerl) b8401c3 Add pool based memory resource & allocator (Martin Leitner-Ankerl) Pull request description: A memory resource similar to `std::pmr::unsynchronized_pool_resource`, but optimized for node-based containers. The goal is to be able to cache more coins with the same memory usage, and allocate/deallocate faster. This is a reimplementation of #22702. The goal was to implement it in a way that is simpler to review & test * There is now a generic `PoolResource` for allocating/deallocating memory. This has practically the same API as `std::pmr::memory_resource`. (Unfortunately I cannot use std::pmr because libc++ simply doesn't implement that API). * Thanks to sipa there is now a fuzzer for PoolResource! On a fast machine I ran it for ~770 million executions without finding any issue. * The estimation of the correct node size is now gone, PoolResource now has multiple pools and just needs to be created large enough to have space for the unordered_map nodes. I ran benchmarks with #22702, mergebase, and this PR. Frequency locked Intel i7-8700, clang++ 13.0.1 to reindex up to block 690000. ```sh bitcoind -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000 ``` The performance is practically identical with #22702, just 0.4% slower. It's ~21% faster than master:   Note that on cache drops mergebase's memory doesnt go so far down because it does not free the `CCoinsMap` bucket array.  ACKs for top commit: LarryRuane: ACK 9f947fc achow101: re-ACK 9f947fc john-moffett: ACK 9f947fc jonatack: re-ACK 9f947fc Tree-SHA512: 48caf57d1775875a612b54388ef64c53952cd48741cacfe20d89049f2fb35301b5c28e69264b7d659a3ca33d4c714d47bafad6fd547c4075f08b45acc87c0f45
jamesob has benchmarked my rebased branch (see #16801 (comment)) concerning the allocator for node based containers, and has seen a 9% speedup along with 8% reduction in memory. That's why I had another good look at my previously closed PR #16801 and updated the code. I've updated the code quite a bit, mostly trying to simplify it and improve correctness:
std::pmr
allocators (see https://en.cppreference.com/w/cpp/memory/polymorphic_allocator)Allocator
thanks to being able to use C++17 behavior (std::allocator_traits
)std::vector
.jamesob it would be great if you could run your benchmarks again with the updated code.
edit laanwj: removed @ signs and html comments from message