Add allocator for node based containers #22702

martinus · 2021-08-13T20:37:53Z

jamesob has benchmarked my rebased branch (see #16801 (comment)) concerning the allocator for node based containers, and has seen a 9% speedup along with 8% reduction in memory. That's why I had another good look at my previously closed PR #16801 and updated the code. I've updated the code quite a bit, mostly trying to simplify it and improve correctness:

Renamed classes to be more in line with naming used for thestd::pmr allocators (see https://en.cppreference.com/w/cpp/memory/polymorphic_allocator)
Simplified Allocator thanks to being able to use C++17 behavior (std::allocator_traits)
Simpler allocation: only allocate blocks of the same size and just keep them in a std::vector.
Only access memory when actually needed. E.g. Linux uses optimistic memory allocation, so memory pages from malloc are only actually made available to the program when accessed
Correctly handle alignment for the inplace linked list.

jamesob it would be great if you could run your benchmarks again with the updated code.

edit laanwj: removed @ signs and html comments from message

practicalswift · 2021-08-15T18:46:11Z

Concept ACK

The numbers look very promising!

Seems like this could be one of those rare optimization opportunities actually worth pursuing :)

src/support/allocators/node_allocator.h

JeremyRubin · 2021-08-15T19:11:58Z

I'm curious what the impact would be on the performance if you allocated N chunks at a time, always, to fill a page.

E.g., if you want to allocate one node, always allocate something like aligned_alloc(sysconf(_SC_PAGESIZE), size) and divide it up into a bunch of pointers for the free list.

Cool properties of doing so: Better cache alignment on items added around the same time, fewer things to keep in the allocated chunks list, if you were to sort the list by pointer (perhaps lazily, when we're not doing anything else) you would get something with good cache alignment again.

martinus · 2021-08-16T05:52:41Z

I'm curious what the impact would be on the performance if you allocated N chunks at a time, always, to fill a page.

I've run the benchmark with ::aligned_alloc(sysconf(_SC_PAGESIZE), ...); and only allocating a single page, and at least in the benchmark there's no difference:

ns/op	op/s	err%	total	benchmark
140.95	7,094,953.03	0.2%	0.08	`NodeAllocator_StdUnorderedMap`
115.57	8,652,942.75	0.1%	0.06	`NodeAllocator_StdUnorderedMapWithNodeAllocator` ::operator new
115.33	8,670,873.08	0.2%	0.06	`NodeAllocator_StdUnorderedMapWithNodeAllocator`aligned_alloc

JeremyRubin · 2021-08-16T09:19:46Z

Do you have specific code for that? The changes sounded a bit more complex than just aligning the allocator. Did you divide the allocation into N free list chunks? Or just over-allocate?

martinus · 2021-08-16T10:03:36Z

Do you have specific code for that? The changes sounded a bit more complex than just aligning the allocator. Did you divide the allocation into N free list chunks? Or just over-allocate?

MemoryResource::AllocateNewBlock is the only place where I actually malloc a new block that's divided into the chunks. Here I've replaced the ::operator new(num_chunks* m_chunk_size_bytes) with ::alligned_alloc(sysconf(_SC_PAGESIZE), num_chunks* m_chunk_size_bytes). And use free in ~MemoryResource.
I also changed the default allocation size m_block_size_bytes from 262144 to sysconf(_SC_PAGESIZE).

So for node size of 104 bytes and page size 4096, this should allocate a page-alligned block with 39*104=4056 bytes.

jamesob · 2021-08-16T14:13:15Z

Some fresh benches in (high dbcache). Continuing to see fairly substantial time/memory improvements with this change.

commands index

bench name	command
ibd.local.range.500000.540000	`bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

#22702 vs. $mergebase (absolute)

bench name	x	#22702	$mergebase
ibd.local.range.500000.540000.total_secs	3	2888.4055 (± 24.7775)	3131.5871 (± 4.8045)
ibd.local.range.500000.540000.peak_rss_KiB	3	`5886729`.3333 (± 3815.6315)	`6352470`.6667 (± 979.1552)
ibd.local.range.500000.540000.cpu_kernel_secs	3	263.5433 (± 1.9605)	269.2233 (± 2.6299)
ibd.local.range.500000.540000.cpu_user_secs	3	12852.1300 (± 15.0661)	13134.5300 (± 2.9587)

#22702 vs. $mergebase (relative)

bench name	x	#22702	$mergebase
ibd.local.range.500000.540000.total_secs	3	1	1.084
ibd.local.range.500000.540000.peak_rss_KiB	3	1	1.079
ibd.local.range.500000.540000.cpu_kernel_secs	3	1	1.022
ibd.local.range.500000.540000.cpu_user_secs	3	1	1.022

martinus · 2021-08-16T17:46:45Z

Thanks for the benchmark! Good to see its 8.4% faster. I think the only reason the memory is lower is because the memusage::DynamicUsage now overestimates the memory requirements of the std::unordered_map that uses the node_allocator. I guess I should correct that. Then the memory usage should stay roughly the same (as it should, that's what -dbcache is for), but the number of transactions that can be cached will increase.

JeremyRubin · 2021-08-16T19:22:09Z

MemoryResource::AllocateNewBlock is the only place where I actually malloc a new block that's divided into the chunks. Here I've replaced the ::operator new(num_chunks* m_chunk_size_bytes) with ::alligned_alloc(sysconf(_SC_PAGESIZE), num_chunks* m_chunk_size_bytes). And use free in ~MemoryResource.
I also changed the default allocation size m_block_size_bytes from 262144 to sysconf(_SC_PAGESIZE).

hmm yeah it's interesting to think through the alignment issues. I think I understand the code a bit better now so it makes sense that most of you allocations are already aligned. One thing that strikes me is that it seems that you have these 104 byte chunks (13 uint64_t s). You might get better properties if you pad them out to the nearest 32/64/128 bytes (memory overhead 24 bytes :( ), because then you'll never have nodes across cache lines. Memory usage is a bit worse though, but cache perfomance may be better this way? https://www.usenix.org/legacy/publications/library/proceedings/als00/2000papers/papers/full_papers/sears/sears_html/index.html for reference.

JeremyRubin · 2021-08-16T19:29:56Z

might also be interesting to bump the CScript prevector size from 28 to 35 (7 bytes). This would help with the above as well because if we're 104/128 byte aligned we have the capacity & then we'd also save a prevector allocation (which means the vector costs 25 bytes + 34-35 bytes) for a non p2sh segwit v0-v1 output, and also means even more indirection.

This might have system impacts very broadly because prevector is a leaky optimization sadly.

jonatack

Concept/Approach ACK. Interesting work! Some initial review feedback at jonatack@df35520

src/test/node_allocator_tests.cpp

src/support/allocators/node_allocator.h

0xB10C · 2021-08-17T22:04:46Z

I've benchmarked the performance of CChainState::ConnectBlock between cde8a99 (PR) and the mergebase 803ef70 (MB) using the validation::connect_block tracepoint. The tracepoint reports the time it took to connect each block. This means my results don't include the time it took to download and, AFAIK, not the time it took to persist the block to disk. I didn't look at memory usage.

I've oriented myself on @jamesob's parameters and synced the blocks 500.000 to 540.000 from a localhost peer three times for the PR and the MB. I used the default dbcache (shouldn't matter for my measurements as cache flushes don't happen while inside CChainState::ConnectBlock). I've used an idle workstation with an i5 6500 and an NVME drive. Room and hardware temperature between runs changed which could have impacted performance. A bpftrace script to hook into the tracepoint can be found here.

Total time spent in CChainState::ConnectBlock per run for PR and MB:

run	1	2	3
PR	1427s	1487s	1486s
MB	1577s	1670s	1661s

(PR run 2 and run 3 are overlapping quite a bit and indistinguishably here)

My results confirm that less time is spent in CChainState::ConnectBlock with PR compared to MB. A performance improvement between 6% - 10% is likely.

To rule out temperature effects on performance, these measurements could be redone with a properly cooled server/workstation.

EDIT: I did another three runs of PR and MB with calling pyperf system tune beforehand as @martinus suggested. PR is still faster.

run	4	5	6
PR	1585s	1537s	1546s
MB	1647s	1688s	1644s

martinus · 2021-08-18T07:03:09Z

@JeremyRubin I actually tried to bump it up to 128 byte to see what that does, but I didn't see any major difference. The benchmark got maybe 1% slower, but maybe the benchmark is also not very realistic. I also tried to reduce the size by 8 byte (which can be done by making use of the Coin's padding in CCoinsCacheEntry, I might open anothe PR for that), and it got about ~1% faster. I think though that even with no performance benefit the node_allocator is worthwhile because it has lower memory overhead, so we can cache more entries with the same -dbcache setting.

@jonatack Thanks for the review! I'm not sure though how I should best incorporate your changes? Should I copy them and add a Co-authored-by in the commit message? (I've never done that)

@0xB10C Thanks for the benchmark! You might get a bit more stable result with pyperf: I always use sudo pyperf system tune before I run a benchmark which does a few things to make benchmarks much more stable (e.g. locks the CPU to a fixed frequency, disables turbo boost)

jonatack · 2021-08-18T21:00:09Z

@jonatack Thanks for the review! I'm not sure though how I should best incorporate your changes? Should I copy them and add a Co-authored-by in the commit message? (I've never done that)

Thanks! I'd say squash in the changes you want to keep. You can add an empty line followed by Co-authored-by: Jon Atack <jon@atack.com> to the bottom of a commit message if you like, but it was just a regular average review :)

jamesob · 2021-08-19T16:17:16Z

Bench results with more modest dbcache (800): 6% speedup, 6.7% less memory usage.

bench name	command
ibd.local.range.500000.540000	`bitcoind -dbcache=800 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876`

bench name	x	#22702	$mergebase
ibd.local.range.500000.540000.total_secs	3	3519.2667 (± 58.0440)	3752.9732 (± 54.8626)
ibd.local.range.500000.540000.peak_rss_KiB	3	2282000.0000 (± 45539.0428)	2443188.0000 (± 22211.2803)
ibd.local.range.500000.540000.cpu_kernel_secs	3	491.2067 (± 4.5352)	493.7567 (± 13.1047)
ibd.local.range.500000.540000.cpu_user_secs	3	13717.5467 (± 15.0752)	13957.7867 (± 18.7773)

bench name	x	#22702	$mergebase
ibd.local.range.500000.540000.total_secs	3	1	1.066
ibd.local.range.500000.540000.peak_rss_KiB	3	1	1.071
ibd.local.range.500000.540000.cpu_kernel_secs	3	1	1.005
ibd.local.range.500000.540000.cpu_user_secs	3	1	1.018

martinus · 2021-08-20T17:03:40Z

I've now updated the code with correct memory estimation, and benchmarked it. I had a fully synchronized node, and ran -reindex-chainstate from block 0 to 690000. All done on a Intel i7 8700, locked to 3200 MHz. I used this command:

/usr/bin/time -v bitcoind -datadir=/run/media/martinus/big/bitcoin/db -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000

The PR ran quite a bit faster than master, 20.8%, with practically the same memory usage. Here some interesting data from /usr/bin/time:

what	mergebase	#22702	Change
Elapsed (wall clock) time (h:mm:ss or m:ss)	4:06:05	3:14:49	20.8% lower
Maximum resident set size (kbytes)	6837496	6856104	0.27% more
Minor (reclaiming a frame) page faults	101689566	70194069	31% fewer
File system inputs	693891464	687479120	0.9% fewer
File system outputs	239830024	188574064	21.4% fewer

Here are some graphs that I created from parsing the debug.log file:

CCoinsMap requires currently 104 byte for one node, and node_allocator basically uses exactly that (a bit more for house keeping of the allocated blocks but that's practically negigable), and the std::unordered_map's default requires 128 byte (16 byte house keeping, and 8 more byte for 16 byte alignment). So we can cache quite a bit more transactions in the same amount of memory:

Due to the different allocation behavior (it allocates one big chunk of memory right as the first Coin is added), I also had to update the checks in validation_flush_tests.cpp

src/support/allocators/node_allocator.h

martinus · 2021-08-22T17:28:11Z

Unfortunately Microsofts implementation of unordered_map behaves quite a bit different from libc++ and libstdc++, so the heuristic that I'm using to detect the node size doesn't work. I've added a WIP to the header until I've fixed this. Most likely I'll ditch the heuristic completely, and properly calculate the correct size. I'm already doing that in the tests anyways (except for windows)

This allocator is potentially useful for all node-based containers like `std::list`, `std::map`, `std::unordered_map` that are heavily used. The allocator uses pools to be more memory efficient and faster than the default std::allocator. Currently this provides specializations for `std::unordered_map`, and other specializations can be added as needed. This updates the `memusage::DynamicUsage` calculation to take the node_allocator allocator into account. Also, the dynamic usage for standard `std::unordered_map` is made more accurate with the new node_size calculation. Rule of 5, drop unused function, spelling fixups and more code cleanup by jonatack, countless suggestions & improvements by ryanofsky. Co-authored-by: Jon Atack <jon@atack.com> Co-authored-by: Russell Yanofsky <russ@yanofsky.org>

In my benchmarks, using this pool allocator for CCoinsMap gives about 20% faster `-reindex-chainstate` with -dbcache=5000 with practically the same memory usage. The change in max RSS changed was 0.3%. Note that the memory usage behavior is now different, since MemoryResource does not free any memory when the map is cleared. Instead, the memory is held in a freelist. This requires now that `CCoinsViewCache::Flush` needs to call `ReallocateCache` so memory is actually freed. If that wouldn't happen, memory usage would not decrease and Flush() would be triggered again soon after. Also, the `validation_flush_tests` tests need to be updated because memory allocation is now done in large pools instead of one node at a time, so the limits need to be updated accordingly.

Removes the call to CoinsTip().ReallocateCache() in ResizeCoinsCaches because this is now done in CCoinsViewCache::Flush(). Also makes ReallocateCache private because it's not needed outside any more.

martinus · 2022-04-26T08:39:03Z

Rebased to resolve conflict in Makefile.test.include
By the way, is there an easy way to see what has actually changed in a rebase? I just moved 2 lines.

jonatack · 2022-05-06T19:40:16Z

Rebased to resolve conflict in Makefile.test.include By the way, is there an easy way to see what has actually changed in a rebase? I just moved 2 lines.

git range-diff a19f641 472f5b0 95c53bf

(Will re-review this PR soon.)

laanwj · 2022-05-10T14:11:33Z

I have been testing this on a busy node (~400 connections) for more than a month. I have noticed no crashes, no memory leaks, and no other unexpected behavior.

sipa · 2022-05-10T16:53:51Z

I see a number of unaddressed comments here by @ryanofsky. What is the status of those?

martinus · 2022-05-10T18:58:11Z

I see a number of unaddressed comments here by @ryanofsky. What is the status of those?

I've now gone through all unaddressed comments and marked all as resolved that I consider fully resolved. As far as I can say only nits remain

laanwj · 2022-06-02T11:34:00Z

I have been testing this on a busy node (~400 connections) for more than a month. I have noticed no crashes, no memory leaks, and no other unexpected behavior.

Still no issues!

Still, to be honest, I think this change is a bit scary, and I have a hard time signing off on it. The reason is that it adds quite a lot of complexity, and depends on quite some C++ magic and cleverness (e.g. relying on unordered_map internals). C++ being not exactly a memory-safe language, this increases the risk of accidental foot-guns. I appreciate the work it's great to make things faster, and I really don't want to discourage it, but, especially as long as it's used for consensus code we do have to err on the side of safety and correctness.

martinus · 2022-06-03T09:07:14Z

Still, to be honest, I think this change is a bit scary

I guess it is, it got a lot more complex that I initially anticipated. I don't think it can be made much simpler. I'm pretty sure it works, but obviously there's no 100% guarantee. So I'm fine with it not being merged, at least then I can't be responsible for a bitcoind crash 😅

Since the code here is not tightly coupled to bitcoind, when I have the time I will try to create a generic c++ library out of it.

Thanks everyone who reviewed & tested or otherwise spent time on this!

jonatack · 2022-06-03T09:13:30Z

@martinus this pull was discussed yesterday at the weekly meeting (see https://www.erisian.com.au/bitcoin-core-dev/log-2022-06-02.html#l-358). I've been running it without issues as well for some time and am concept ACK (but need to re-review carefully). @sipa wrote "I need to look at the latest code changes again, but I'm very concept ACK on that allocator", followed by @laanwj "to be clear i'm not against it, but we need to be really sure it gets enough review". For info :)

0xB10C · 2022-06-03T13:12:45Z

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

laanwj · 2022-06-03T13:22:57Z

I didn't mean for it to be closed @martinus. Of course, you're free to. Making a generic library sounds like a good idea. We could subtree or import it once it's "proven".

I just wanted to voice my concern. If we can drum up enough reviewers here with the appropriate knowledge about C++ internals (and it doesn't all rest on @sipa's shoulders, for example), it'd be good. Something that was proposed in the meeting yesterday was an advanced review club.

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

Memory corruption resulting in consensus problems would be the worst outcome. Other crashes (especially if hard to debug / diagnose) would be bad too, of course.

jamesob · 2022-06-03T14:23:35Z

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

I think the major concern here could be summarized as "we're worried about unknown unknowns." Swapping out an allocator (or as in the original case of #16718, the whole map implementation) entails displacing a lot of low-level C++ code that is widely deployed and tested by the standard library with code that we (or if not just us, a much smaller subset of the software world) maintain and test.

If anything goes wrong with the UTXO set, it's no news to everyone that it certainly spells big issues for bitcoin. While the performance gains here are very enticing, and indeed @luke-jr earlier made the interesting point that maybe moving this critical code into our domain of control (and out of the black box of std) may be preferable, since we would explicitly understand and maintain a very core dependency of the validation process, this "mountain man" approach obviously has an associated cost: maintaining such an important and sensitive piece of the system represents a liability that we would no longer shoulder alongside the wider C++ usage base.

I originally proposed the hashmap implementation change that (I think?) inspired @martinus to undertake this work. I'm glad he has done this work, even if it remains experimental, because it's important for us to understand this design space better. But that said I can understand why @laanwj would express an abundance of caution about this kind of thing, and it resonates with me. Everyone here understands that a change like this should not be done lightly, but even given that, it's hard to know when we've "tested enough" to know that this will be safe for perpetuity. Unknown unknowns.

All that said, I'm far from the most qualified systems guy on this project, so it would be interesting to hear from others like @sipa @ryanofsky @theuni @TheBlueMatt. Maybe we can fuzz this to death on a bunch of different platforms and convince ourselves that it's as safe or safer than using std's allocator. Such a determination is out of my ability.

martinus · 2022-06-05T15:04:02Z

I think I've found a way to get almost all of the benefit of the node_allocator with a significantly less scary implementation. Same memory advantage, but a bit slower. I'll prepare a new PR 👍

sipa · 2022-06-05T18:18:54Z

@martinus I started working on a fuzz test for the allocator, to show at least its behavior w.r.t. its external interface is correct.

If you're going to simplify things: will there still be an allocator class, and a concept of a memory resource? If so, will the memory resource still be parametrized by the size of the pooled objects in it?

martinus · 2022-06-05T20:26:22Z

@martinus I started working on a fuzz test for the allocator, to show at least its behavior w.r.t. its external interface is correct.

That's awesome!

If you're going to simplify things: will there still be an allocator class, and a concept of a memory resource? If so, will the memory resource still be parametrized by the size of the pooled objects in it?

There will be just a memory resource, which will be passed to a std::pmr::polymorphic_allocator. I have a prototype of the memory resource here: https://gist.github.com/martinus/0cee4dd177beda78bb1e2e3d128c2d70
That's basically all that is needed in <100 lines of code. This implements the abstract std::pmr::memory_resource interface which is much simpler than implementing a full allocator. Doesn't need any rebinding, type casting, reinterpret_cast.

Also, instead of having to correctly guess the pool's memory size I simply use a vector with multiple pools of different size. Most of them will stay empty, but that overhead is negligible.

Usage is like so:

using CCoinsMap = std::pmr::unordered_map<COutPoint, CCoinsCacheEntry, SaltedOutpointHasher>;
auto mr = NodePoolResource<256>(); // large enough to fit nodes size
auto map = CCoinsMap{0, SaltedOutpointHasher{}, std::equal_to<COutPoint>{}, &mr};

sipa · 2022-06-06T16:05:41Z

@martinus Cool, that looks a lot simpler. The runtime indirection from the polymorphic resource does sound like it'd introduce a performance penalty, so I think we'll want benchmarks to see if it doesn't hurt.

A few thoughts:

I like the idea of giving the resource a pool for every allocation size up to a limit; that seems far less fragile than something that only works for one size.
It seems it would be possible to actually (possibly later, even) introduce a class like std::polymorphic_allocator, but which is parametrized by the concrete type of the resource? If the NodePoolResource is marked final (perhaps a good idea to do in general), then the compiler would be able to optimize away the dynamic calls even. This would allow getting the same benefits as the earlier approach in a more incremental way (and perhaps may be useful for benchmarking the overhead of the polymorphism exactly).
I believe it may be desirable to round up size at the start of do_allocate / do_deallocate instead of using 0 for small sizes. While that is not optimally efficient, it's still better in practice than deferring to the default allocator (as at least on common platforms, that has an overhead of at least two pointers). The requirements would be (a) size has to be a multiple of alignment_of(FreeList*) and of alignment, and (b) size has to be at least sizeof(FreeList*) and at least size. Because sizes are always a multiple of alignment, and alignments are always powers of two, I believe these two requirements can be met as std::max(S, (size + A - 1) & ~size_t{A - 1}), where S = sizeof(FreeList*) and A = alignment_of(FreeList*).
When the above is done, I think the code can otherwise actually ignore the input alignment value (making it work for any alignment up to the alignment of the pools themselves, which ought to be sufficient for any object).
Would it make sense to use an array of pools, rather than a vector? That avoids one memory indirection at runtime, and for MAX_BLOCK_SIZE_BYTES=256, an array of 63 elements suffices (even with A=S=4).

martinus · 2022-06-07T10:59:59Z

Thanks @spia for having a close look!

I believe it may be desirable to round up size at the start of do_allocate / do_deallocate instead of using 0 for small sizes.

I can do that and it probably makes the resource more widely usable, but at least for the CCoinsMap it won't make a difference because it has the same alignment as the FreeList*.

Would it make sense to use an array of pools, rather than a vector?

Yes! I originally wanted to get rid of the template argument, but using std::array is actually a bit faster in my benchmarks so I'll go that route.

I'm working on that in this branch, where I'm now using std::array and rounding up: master...martinus:2022-06-very-not-scary-NodePoolResource

I hope to have something PR-ready in a week or two.

sipa · 2022-06-07T20:21:55Z

@martinus Here is a generic memory resource fuzzer: https://github.com/sipa/bitcoin/commits/202206_pmrallocfuzz

It also includes a somewhat modified version of your code, with a few more template parameters/sanity checks, and the rounding as I was thinking of it. Seems to work great. Feel free to try on top of your version of the resource, or combine ideas/whatever.

I like making the code generally usable in wider scenarios than just the ones we care about, as it means generic tests can be written for it, which need fewer assumptions.

Another idea for something to include in the tests which I haven't done: add a member function to the resource to ask if it still has any outstanding allocations (by summing the size of all allocated chunks, and comparing with the sum of all elements in the freelist).

martinus · 2022-06-10T05:31:42Z

Another idea for something to include in the tests which I haven't done: add a member function to the resource to ask if it still has any outstanding allocations (by summing the size of all allocated chunks, and comparing with the sum of all elements in the freelist).

Excellent idea, I've now added an even stronger check: make sure that all data in the freelist actually comes from the chunks, doesn't overlap in any way, and each byte from the chunks can be found in the freelists or is still available

9f947fc Use PoolAllocator for CCoinsMap (Martin Leitner-Ankerl) 5e4ac5a Call ReallocateCache() on each Flush() (Martin Leitner-Ankerl) 1afca6b Add PoolResource fuzzer (Martin Leitner-Ankerl) e19943f Calculate memory usage correctly for unordered_maps that use PoolAllocator (Martin Leitner-Ankerl) b8401c3 Add pool based memory resource & allocator (Martin Leitner-Ankerl) Pull request description: A memory resource similar to `std::pmr::unsynchronized_pool_resource`, but optimized for node-based containers. The goal is to be able to cache more coins with the same memory usage, and allocate/deallocate faster. This is a reimplementation of #22702. The goal was to implement it in a way that is simpler to review & test * There is now a generic `PoolResource` for allocating/deallocating memory. This has practically the same API as `std::pmr::memory_resource`. (Unfortunately I cannot use std::pmr because libc++ simply doesn't implement that API). * Thanks to sipa there is now a fuzzer for PoolResource! On a fast machine I ran it for ~770 million executions without finding any issue. * The estimation of the correct node size is now gone, PoolResource now has multiple pools and just needs to be created large enough to have space for the unordered_map nodes. I ran benchmarks with #22702, mergebase, and this PR. Frequency locked Intel i7-8700, clang++ 13.0.1 to reindex up to block 690000. ```sh bitcoind -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000 ``` The performance is practically identical with #22702, just 0.4% slower. It's ~21% faster than master: ![Progress in Million Transactions over Time(2)](https://user-images.githubusercontent.com/14386/173288685-91952ade-f304-4825-8bfb-0725a71ca17b.png) ![Size of Cache in MiB over Time](https://user-images.githubusercontent.com/14386/173291421-e6b410be-ac77-479b-ad24-5fafcebf81eb.png) Note that on cache drops mergebase's memory doesnt go so far down because it does not free the `CCoinsMap` bucket array. ![Size of Cache in Million tx over Time(1)](https://user-images.githubusercontent.com/14386/173288703-a80c9c9e-93c8-4a16-9df8-610c89c61cc4.png) ACKs for top commit: LarryRuane: ACK 9f947fc achow101: re-ACK 9f947fc john-moffett: ACK 9f947fc jonatack: re-ACK 9f947fc Tree-SHA512: 48caf57d1775875a612b54388ef64c53952cd48741cacfe20d89049f2fb35301b5c28e69264b7d659a3ca33d4c714d47bafad6fd547c4075f08b45acc87c0f45

martinus force-pushed the 2019-08-bulkpoolallocator branch from 4a330e5 to cde8a99 Compare August 13, 2021 20:49

DrahtBot added Build system UTXO Db and Indexes labels Aug 13, 2021

JeremyRubin reviewed Aug 15, 2021

View reviewed changes

src/support/allocators/node_allocator.h Outdated Show resolved Hide resolved

JeremyRubin reviewed Aug 15, 2021

View reviewed changes

src/support/allocators/node_allocator.h Outdated Show resolved Hide resolved

jonatack reviewed Aug 17, 2021

View reviewed changes

src/test/node_allocator_tests.cpp Outdated Show resolved Hide resolved

src/support/allocators/node_allocator.h Outdated Show resolved Hide resolved

laanwj added Utils/log/libs Resource usage and removed Build system labels Aug 18, 2021

martinus force-pushed the 2019-08-bulkpoolallocator branch from 3574aee to 22832f6 Compare August 20, 2021 17:01

martinus force-pushed the 2019-08-bulkpoolallocator branch from 22832f6 to 6fa1a72 Compare August 20, 2021 17:11

sipa reviewed Aug 20, 2021

View reviewed changes

martinus force-pushed the 2019-08-bulkpoolallocator branch from 698dd3f to 25cf765 Compare August 22, 2021 06:13

martinus changed the title ~~Add allocator for node based containers~~ [WIP] Add allocator for node based containers Aug 22, 2021

martinus force-pushed the 2019-08-bulkpoolallocator branch from 25cf765 to c7cc614 Compare August 23, 2021 13:56

martinus and others added 3 commits April 26, 2022 10:31

Don't call ReallocateCache twice in ResizeCoinsCaches

95c53bf

Removes the call to CoinsTip().ReallocateCache() in ResizeCoinsCaches because this is now done in CCoinsViewCache::Flush(). Also makes ReallocateCache private because it's not needed outside any more.

martinus force-pushed the 2019-08-bulkpoolallocator branch from 472f5b0 to 95c53bf Compare April 26, 2022 08:38

DrahtBot removed the Needs rebase label Apr 26, 2022

martinus closed this Jun 3, 2022

benthecarman mentioned this pull request Jun 3, 2022

June topics austinbitdevs/austinbitdevs.com#7

Closed

JeremyRubin mentioned this pull request Jun 3, 2022

[WIP] add free list to unordered map #25274

Closed

martinus mentioned this pull request Jun 11, 2022

Add pool based memory resource #25325

Merged

fanquake removed this from the 24.0 milestone Sep 13, 2022

bitcoin locked and limited conversation to collaborators Sep 13, 2023

Add allocator for node based containers #22702

Add allocator for node based containers #22702

Uh oh!

Conversation

martinus commented Aug 13, 2021 • edited by laanwj Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

practicalswift commented Aug 15, 2021

Uh oh!

Uh oh!

Uh oh!

JeremyRubin commented Aug 15, 2021

Uh oh!

martinus commented Aug 16, 2021

Uh oh!

JeremyRubin commented Aug 16, 2021

Uh oh!

martinus commented Aug 16, 2021

Uh oh!

jamesob commented Aug 16, 2021

commands index

#22702 vs. $mergebase (absolute)

#22702 vs. $mergebase (relative)

Uh oh!

martinus commented Aug 16, 2021

Uh oh!

JeremyRubin commented Aug 16, 2021

Uh oh!

JeremyRubin commented Aug 16, 2021

Uh oh!

jonatack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

0xB10C commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinus commented Aug 18, 2021

Uh oh!

jonatack commented Aug 18, 2021

Uh oh!

jamesob commented Aug 19, 2021

Uh oh!

martinus commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martinus commented Aug 22, 2021

Uh oh!

martinus commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonatack commented May 6, 2022

Uh oh!

laanwj commented May 10, 2022

Uh oh!

sipa commented May 10, 2022

Uh oh!

martinus commented May 10, 2022

Uh oh!

laanwj commented Jun 2, 2022

Uh oh!

martinus commented Jun 3, 2022

Uh oh!

jonatack commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xB10C commented Jun 3, 2022

Uh oh!

laanwj commented Jun 3, 2022

Uh oh!

jamesob commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinus commented Jun 5, 2022

Uh oh!

martinus commented Aug 13, 2021 •

edited by laanwj

Loading

0xB10C commented Aug 17, 2021 •

edited

Loading

martinus commented Aug 20, 2021 •

edited

Loading

martinus commented Apr 26, 2022 •

edited

Loading

jonatack commented Jun 3, 2022 •

edited

Loading

jamesob commented Jun 3, 2022 •

edited

Loading

sipa commented Jun 6, 2022 •

edited

Loading