Skip to content

Conversation

martinus
Copy link
Contributor

@martinus martinus commented Aug 13, 2021

jamesob has benchmarked my rebased branch (see #16801 (comment)) concerning the allocator for node based containers, and has seen a 9% speedup along with 8% reduction in memory. That's why I had another good look at my previously closed PR #16801 and updated the code. I've updated the code quite a bit, mostly trying to simplify it and improve correctness:

  • Renamed classes to be more in line with naming used for thestd::pmr allocators (see https://en.cppreference.com/w/cpp/memory/polymorphic_allocator)
  • Simplified Allocator thanks to being able to use C++17 behavior (std::allocator_traits)
  • Simpler allocation: only allocate blocks of the same size and just keep them in a std::vector.
  • Only access memory when actually needed. E.g. Linux uses optimistic memory allocation, so memory pages from malloc are only actually made available to the program when accessed
  • Correctly handle alignment for the inplace linked list.

jamesob it would be great if you could run your benchmarks again with the updated code.

edit laanwj: removed @ signs and html comments from message

@practicalswift
Copy link
Contributor

Concept ACK

The numbers look very promising!

Seems like this could be one of those rare optimization opportunities actually worth pursuing :)

@JeremyRubin
Copy link
Contributor

I'm curious what the impact would be on the performance if you allocated N chunks at a time, always, to fill a page.

E.g., if you want to allocate one node, always allocate something like aligned_alloc(sysconf(_SC_PAGESIZE), size) and divide it up into a bunch of pointers for the free list.

Cool properties of doing so: Better cache alignment on items added around the same time, fewer things to keep in the allocated chunks list, if you were to sort the list by pointer (perhaps lazily, when we're not doing anything else) you would get something with good cache alignment again.

@martinus
Copy link
Contributor Author

I'm curious what the impact would be on the performance if you allocated N chunks at a time, always, to fill a page.

I've run the benchmark with ::aligned_alloc(sysconf(_SC_PAGESIZE), ...); and only allocating a single page, and at least in the benchmark there's no difference:

ns/op op/s err% total benchmark
140.95 7,094,953.03 0.2% 0.08 NodeAllocator_StdUnorderedMap
115.57 8,652,942.75 0.1% 0.06 NodeAllocator_StdUnorderedMapWithNodeAllocator ::operator new
115.33 8,670,873.08 0.2% 0.06 NodeAllocator_StdUnorderedMapWithNodeAllocatoraligned_alloc

@JeremyRubin
Copy link
Contributor

Do you have specific code for that? The changes sounded a bit more complex than just aligning the allocator. Did you divide the allocation into N free list chunks? Or just over-allocate?

@martinus
Copy link
Contributor Author

Do you have specific code for that? The changes sounded a bit more complex than just aligning the allocator. Did you divide the allocation into N free list chunks? Or just over-allocate?

MemoryResource::AllocateNewBlock is the only place where I actually malloc a new block that's divided into the chunks. Here I've replaced the ::operator new(num_chunks* m_chunk_size_bytes) with ::alligned_alloc(sysconf(_SC_PAGESIZE), num_chunks* m_chunk_size_bytes). And use free in ~MemoryResource.
I also changed the default allocation size m_block_size_bytes from 262144 to sysconf(_SC_PAGESIZE).

So for node size of 104 bytes and page size 4096, this should allocate a page-alligned block with 39*104=4056 bytes.

@jamesob
Copy link
Contributor

jamesob commented Aug 16, 2021

Some fresh benches in (high dbcache). Continuing to see fairly substantial time/memory improvements with this change.

ibd local range 500000 540000

commands index

bench name command
ibd.local.range.500000.540000 bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

#22702 vs. $mergebase (absolute)

bench name x #22702 $mergebase
ibd.local.range.500000.540000.total_secs 3 2888.4055 (± 24.7775) 3131.5871 (± 4.8045)
ibd.local.range.500000.540000.peak_rss_KiB 3 5886729.3333 (± 3815.6315) 6352470.6667 (± 979.1552)
ibd.local.range.500000.540000.cpu_kernel_secs 3 263.5433 (± 1.9605) 269.2233 (± 2.6299)
ibd.local.range.500000.540000.cpu_user_secs 3 12852.1300 (± 15.0661) 13134.5300 (± 2.9587)

#22702 vs. $mergebase (relative)

bench name x #22702 $mergebase
ibd.local.range.500000.540000.total_secs 3 1 1.084
ibd.local.range.500000.540000.peak_rss_KiB 3 1 1.079
ibd.local.range.500000.540000.cpu_kernel_secs 3 1 1.022
ibd.local.range.500000.540000.cpu_user_secs 3 1 1.022

@martinus
Copy link
Contributor Author

Thanks for the benchmark! Good to see its 8.4% faster. I think the only reason the memory is lower is because the memusage::DynamicUsage now overestimates the memory requirements of the std::unordered_map that uses the node_allocator. I guess I should correct that. Then the memory usage should stay roughly the same (as it should, that's what -dbcache is for), but the number of transactions that can be cached will increase.

@JeremyRubin
Copy link
Contributor

MemoryResource::AllocateNewBlock is the only place where I actually malloc a new block that's divided into the chunks. Here I've replaced the ::operator new(num_chunks* m_chunk_size_bytes) with ::alligned_alloc(sysconf(_SC_PAGESIZE), num_chunks* m_chunk_size_bytes). And use free in ~MemoryResource.
I also changed the default allocation size m_block_size_bytes from 262144 to sysconf(_SC_PAGESIZE).

hmm yeah it's interesting to think through the alignment issues. I think I understand the code a bit better now so it makes sense that most of you allocations are already aligned. One thing that strikes me is that it seems that you have these 104 byte chunks (13 uint64_t s). You might get better properties if you pad them out to the nearest 32/64/128 bytes (memory overhead 24 bytes :( ), because then you'll never have nodes across cache lines. Memory usage is a bit worse though, but cache perfomance may be better this way? https://www.usenix.org/legacy/publications/library/proceedings/als00/2000papers/papers/full_papers/sears/sears_html/index.html for reference.

@JeremyRubin
Copy link
Contributor

might also be interesting to bump the CScript prevector size from 28 to 35 (7 bytes). This would help with the above as well because if we're 104/128 byte aligned we have the capacity & then we'd also save a prevector allocation (which means the vector costs 25 bytes + 34-35 bytes) for a non p2sh segwit v0-v1 output, and also means even more indirection.

This might have system impacts very broadly because prevector is a leaky optimization sadly.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept/Approach ACK. Interesting work! Some initial review feedback at jonatack@df35520

@0xB10C
Copy link
Contributor

0xB10C commented Aug 17, 2021

I've benchmarked the performance of CChainState::ConnectBlock between cde8a99 (PR) and the mergebase 803ef70 (MB) using the validation::connect_block tracepoint. The tracepoint reports the time it took to connect each block. This means my results don't include the time it took to download and, AFAIK, not the time it took to persist the block to disk. I didn't look at memory usage.

I've oriented myself on @jamesob's parameters and synced the blocks 500.000 to 540.000 from a localhost peer three times for the PR and the MB. I used the default dbcache (shouldn't matter for my measurements as cache flushes don't happen while inside CChainState::ConnectBlock). I've used an idle workstation with an i5 6500 and an NVME drive. Room and hardware temperature between runs changed which could have impacted performance. A bpftrace script to hook into the tracepoint can be found here.

Total time spent in CChainState::ConnectBlock per run for PR and MB:

run 1 2 3
PR 1427s 1487s 1486s
MB 1577s 1670s 1661s

image
(PR run 2 and run 3 are overlapping quite a bit and indistinguishably here)

My results confirm that less time is spent in CChainState::ConnectBlock with PR compared to MB. A performance improvement between 6% - 10% is likely.

To rule out temperature effects on performance, these measurements could be redone with a properly cooled server/workstation.


EDIT: I did another three runs of PR and MB with calling pyperf system tune beforehand as @martinus suggested. PR is still faster.

run 4 5 6
PR 1585s 1537s 1546s
MB 1647s 1688s 1644s

image

@martinus
Copy link
Contributor Author

@JeremyRubin I actually tried to bump it up to 128 byte to see what that does, but I didn't see any major difference. The benchmark got maybe 1% slower, but maybe the benchmark is also not very realistic. I also tried to reduce the size by 8 byte (which can be done by making use of the Coin's padding in CCoinsCacheEntry, I might open anothe PR for that), and it got about ~1% faster. I think though that even with no performance benefit the node_allocator is worthwhile because it has lower memory overhead, so we can cache more entries with the same -dbcache setting.

@jonatack Thanks for the review! I'm not sure though how I should best incorporate your changes? Should I copy them and add a Co-authored-by in the commit message? (I've never done that)

@0xB10C Thanks for the benchmark! You might get a bit more stable result with pyperf: I always use sudo pyperf system tune before I run a benchmark which does a few things to make benchmarks much more stable (e.g. locks the CPU to a fixed frequency, disables turbo boost)

@jonatack
Copy link
Member

@jonatack Thanks for the review! I'm not sure though how I should best incorporate your changes? Should I copy them and add a Co-authored-by in the commit message? (I've never done that)

Thanks! I'd say squash in the changes you want to keep. You can add an empty line followed by Co-authored-by: Jon Atack <jon@atack.com> to the bottom of a commit message if you like, but it was just a regular average review :)

@jamesob
Copy link
Contributor

jamesob commented Aug 19, 2021

Bench results with more modest dbcache (800): 6% speedup, 6.7% less memory usage.

bench name command
ibd.local.range.500000.540000 bitcoind -dbcache=800 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876
bench name x #22702 $mergebase
ibd.local.range.500000.540000.total_secs 3 3519.2667 (± 58.0440) 3752.9732 (± 54.8626)
ibd.local.range.500000.540000.peak_rss_KiB 3 2282000.0000 (± 45539.0428) 2443188.0000 (± 22211.2803)
ibd.local.range.500000.540000.cpu_kernel_secs 3 491.2067 (± 4.5352) 493.7567 (± 13.1047)
ibd.local.range.500000.540000.cpu_user_secs 3 13717.5467 (± 15.0752) 13957.7867 (± 18.7773)
bench name x #22702 $mergebase
ibd.local.range.500000.540000.total_secs 3 1 1.066
ibd.local.range.500000.540000.peak_rss_KiB 3 1 1.071
ibd.local.range.500000.540000.cpu_kernel_secs 3 1 1.005
ibd.local.range.500000.540000.cpu_user_secs 3 1 1.018

@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 3574aee to 22832f6 Compare August 20, 2021 17:01
@martinus
Copy link
Contributor Author

martinus commented Aug 20, 2021

I've now updated the code with correct memory estimation, and benchmarked it. I had a fully synchronized node, and ran -reindex-chainstate from block 0 to 690000. All done on a Intel i7 8700, locked to 3200 MHz. I used this command:

/usr/bin/time -v bitcoind -datadir=/run/media/martinus/big/bitcoin/db -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000

The PR ran quite a bit faster than master, 20.8%, with practically the same memory usage. Here some interesting data from /usr/bin/time:

what mergebase #22702 Change
Elapsed (wall clock) time (h:mm:ss or m:ss) 4:06:05 3:14:49 20.8% lower
Maximum resident set size (kbytes) 6837496 6856104 0.27% more
Minor (reclaiming a frame) page faults 101689566 70194069 31% fewer
File system inputs 693891464 687479120 0.9% fewer
File system outputs 239830024 188574064 21.4% fewer

Here are some graphs that I created from parsing the debug.log file:

Progress in Million Transactions over Time(1)

Size of Cache in MiB over Time

CCoinsMap requires currently 104 byte for one node, and node_allocator basically uses exactly that (a bit more for house keeping of the allocated blocks but that's practically negigable), and the std::unordered_map's default requires 128 byte (16 byte house keeping, and 8 more byte for 16 byte alignment). So we can cache quite a bit more transactions in the same amount of memory:

Size of Cache in Million tx over Time

Due to the different allocation behavior (it allocates one big chunk of memory right as the first Coin is added), I also had to update the checks in validation_flush_tests.cpp

@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 22832f6 to 6fa1a72 Compare August 20, 2021 17:11
@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 698dd3f to 25cf765 Compare August 22, 2021 06:13
@martinus martinus changed the title Add allocator for node based containers [WIP] Add allocator for node based containers Aug 22, 2021
@martinus
Copy link
Contributor Author

Unfortunately Microsofts implementation of unordered_map behaves quite a bit different from libc++ and libstdc++, so the heuristic that I'm using to detect the node size doesn't work. I've added a WIP to the header until I've fixed this. Most likely I'll ditch the heuristic completely, and properly calculate the correct size. I'm already doing that in the tests anyways (except for windows)

@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 25cf765 to c7cc614 Compare August 23, 2021 13:56
martinus and others added 3 commits April 26, 2022 10:31
This allocator is potentially useful for all node-based containers like
`std::list`, `std::map`, `std::unordered_map` that are heavily used.

The allocator uses pools to be more memory efficient and faster than
the default std::allocator. Currently this provides specializations for
`std::unordered_map`, and other specializations can be added as needed.

This updates the `memusage::DynamicUsage` calculation to take the
node_allocator allocator into account.

Also, the dynamic usage for standard `std::unordered_map` is made more
accurate with the new node_size calculation.

Rule of 5, drop unused function, spelling fixups and more code cleanup
by jonatack, countless suggestions & improvements by ryanofsky.

Co-authored-by: Jon Atack <jon@atack.com>
Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
In my benchmarks, using this pool allocator for CCoinsMap gives about
20% faster `-reindex-chainstate` with -dbcache=5000 with practically the
same memory usage. The change in max RSS changed was 0.3%.

Note that the memory usage behavior is now different, since MemoryResource
does not free any memory when the map is cleared. Instead, the memory is
held in a freelist. This requires now that `CCoinsViewCache::Flush`
needs to call `ReallocateCache` so memory is actually freed. If that
wouldn't happen, memory usage would not decrease and Flush() would be
triggered again soon after.

Also, the `validation_flush_tests` tests need to be updated because
memory allocation is now done in large pools instead of one node at a
time, so the limits need to be updated accordingly.
Removes the call to CoinsTip().ReallocateCache() in ResizeCoinsCaches because this is now done in CCoinsViewCache::Flush().

Also makes ReallocateCache private because it's not needed outside any more.
@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 472f5b0 to 95c53bf Compare April 26, 2022 08:38
@martinus
Copy link
Contributor Author

martinus commented Apr 26, 2022

Rebased to resolve conflict in Makefile.test.include
By the way, is there an easy way to see what has actually changed in a rebase? I just moved 2 lines.

@jonatack
Copy link
Member

jonatack commented May 6, 2022

Rebased to resolve conflict in Makefile.test.include By the way, is there an easy way to see what has actually changed in a rebase? I just moved 2 lines.

git range-diff a19f641 472f5b0 95c53bf

(Will re-review this PR soon.)

@laanwj
Copy link
Member

laanwj commented May 10, 2022

I have been testing this on a busy node (~400 connections) for more than a month. I have noticed no crashes, no memory leaks, and no other unexpected behavior.

@sipa
Copy link
Member

sipa commented May 10, 2022

I see a number of unaddressed comments here by @ryanofsky. What is the status of those?

@martinus
Copy link
Contributor Author

I see a number of unaddressed comments here by @ryanofsky. What is the status of those?

I've now gone through all unaddressed comments and marked all as resolved that I consider fully resolved. As far as I can say only nits remain

@laanwj
Copy link
Member

laanwj commented Jun 2, 2022

I have been testing this on a busy node (~400 connections) for more than a month. I have noticed no crashes, no memory leaks, and no other unexpected behavior.

Still no issues!

Still, to be honest, I think this change is a bit scary, and I have a hard time signing off on it. The reason is that it adds quite a lot of complexity, and depends on quite some C++ magic and cleverness (e.g. relying on unordered_map internals). C++ being not exactly a memory-safe language, this increases the risk of accidental foot-guns. I appreciate the work it's great to make things faster, and I really don't want to discourage it, but, especially as long as it's used for consensus code we do have to err on the side of safety and correctness.

@martinus
Copy link
Contributor Author

martinus commented Jun 3, 2022

Still, to be honest, I think this change is a bit scary

I guess it is, it got a lot more complex that I initially anticipated. I don't think it can be made much simpler. I'm pretty sure it works, but obviously there's no 100% guarantee. So I'm fine with it not being merged, at least then I can't be responsible for a bitcoind crash 😅

Since the code here is not tightly coupled to bitcoind, when I have the time I will try to create a generic c++ library out of it.

Thanks everyone who reviewed & tested or otherwise spent time on this!

@martinus martinus closed this Jun 3, 2022
@jonatack
Copy link
Member

jonatack commented Jun 3, 2022

@martinus this pull was discussed yesterday at the weekly meeting (see https://www.erisian.com.au/bitcoin-core-dev/log-2022-06-02.html#l-358). I've been running it without issues as well for some time and am concept ACK (but need to re-review carefully). @sipa wrote "I need to look at the latest code changes again, but I'm very concept ACK on that allocator", followed by @laanwj "to be clear i'm not against it, but we need to be really sure it gets enough review". For info :)

@0xB10C
Copy link
Contributor

0xB10C commented Jun 3, 2022

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

@laanwj
Copy link
Member

laanwj commented Jun 3, 2022

I didn't mean for it to be closed @martinus. Of course, you're free to. Making a generic library sounds like a good idea. We could subtree or import it once it's "proven".

I just wanted to voice my concern. If we can drum up enough reviewers here with the appropriate knowledge about C++ internals (and it doesn't all rest on @sipa's shoulders, for example), it'd be good. Something that was proposed in the meeting yesterday was an advanced review club.

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

Memory corruption resulting in consensus problems would be the worst outcome. Other crashes (especially if hard to debug / diagnose) would be bad too, of course.

@jamesob
Copy link
Contributor

jamesob commented Jun 3, 2022

I'm trying to understand what potential problems we fear here. Is it that the node crashes or is it consensus problems (or both)?

I think the major concern here could be summarized as "we're worried about unknown unknowns." Swapping out an allocator (or as in the original case of #16718, the whole map implementation) entails displacing a lot of low-level C++ code that is widely deployed and tested by the standard library with code that we (or if not just us, a much smaller subset of the software world) maintain and test.

If anything goes wrong with the UTXO set, it's no news to everyone that it certainly spells big issues for bitcoin. While the performance gains here are very enticing, and indeed @luke-jr earlier made the interesting point that maybe moving this critical code into our domain of control (and out of the black box of std) may be preferable, since we would explicitly understand and maintain a very core dependency of the validation process, this "mountain man" approach obviously has an associated cost: maintaining such an important and sensitive piece of the system represents a liability that we would no longer shoulder alongside the wider C++ usage base.

I originally proposed the hashmap implementation change that (I think?) inspired @martinus to undertake this work. I'm glad he has done this work, even if it remains experimental, because it's important for us to understand this design space better. But that said I can understand why @laanwj would express an abundance of caution about this kind of thing, and it resonates with me. Everyone here understands that a change like this should not be done lightly, but even given that, it's hard to know when we've "tested enough" to know that this will be safe for perpetuity. Unknown unknowns.

All that said, I'm far from the most qualified systems guy on this project, so it would be interesting to hear from others like @sipa @ryanofsky @theuni @TheBlueMatt. Maybe we can fuzz this to death on a bunch of different platforms and convince ourselves that it's as safe or safer than using std's allocator. Such a determination is out of my ability.

@martinus
Copy link
Contributor Author

martinus commented Jun 5, 2022

I think I've found a way to get almost all of the benefit of the node_allocator with a significantly less scary implementation. Same memory advantage, but a bit slower. I'll prepare a new PR 👍

@sipa
Copy link
Member

sipa commented Jun 5, 2022

@martinus I started working on a fuzz test for the allocator, to show at least its behavior w.r.t. its external interface is correct.

If you're going to simplify things: will there still be an allocator class, and a concept of a memory resource? If so, will the memory resource still be parametrized by the size of the pooled objects in it?

@martinus
Copy link
Contributor Author

martinus commented Jun 5, 2022

@martinus I started working on a fuzz test for the allocator, to show at least its behavior w.r.t. its external interface is correct.

That's awesome!

If you're going to simplify things: will there still be an allocator class, and a concept of a memory resource? If so, will the memory resource still be parametrized by the size of the pooled objects in it?

There will be just a memory resource, which will be passed to a std::pmr::polymorphic_allocator. I have a prototype of the memory resource here: https://gist.github.com/martinus/0cee4dd177beda78bb1e2e3d128c2d70
That's basically all that is needed in <100 lines of code. This implements the abstract std::pmr::memory_resource interface which is much simpler than implementing a full allocator. Doesn't need any rebinding, type casting, reinterpret_cast.

Also, instead of having to correctly guess the pool's memory size I simply use a vector with multiple pools of different size. Most of them will stay empty, but that overhead is negligible.

Usage is like so:

using CCoinsMap = std::pmr::unordered_map<COutPoint, CCoinsCacheEntry, SaltedOutpointHasher>;
auto mr = NodePoolResource<256>(); // large enough to fit nodes size
auto map = CCoinsMap{0, SaltedOutpointHasher{}, std::equal_to<COutPoint>{}, &mr};

@sipa
Copy link
Member

sipa commented Jun 6, 2022

@martinus Cool, that looks a lot simpler. The runtime indirection from the polymorphic resource does sound like it'd introduce a performance penalty, so I think we'll want benchmarks to see if it doesn't hurt.

A few thoughts:

  • I like the idea of giving the resource a pool for every allocation size up to a limit; that seems far less fragile than something that only works for one size.
  • It seems it would be possible to actually (possibly later, even) introduce a class like std::polymorphic_allocator, but which is parametrized by the concrete type of the resource? If the NodePoolResource is marked final (perhaps a good idea to do in general), then the compiler would be able to optimize away the dynamic calls even. This would allow getting the same benefits as the earlier approach in a more incremental way (and perhaps may be useful for benchmarking the overhead of the polymorphism exactly).
  • I believe it may be desirable to round up size at the start of do_allocate / do_deallocate instead of using 0 for small sizes. While that is not optimally efficient, it's still better in practice than deferring to the default allocator (as at least on common platforms, that has an overhead of at least two pointers). The requirements would be (a) size has to be a multiple of alignment_of(FreeList*) and of alignment, and (b) size has to be at least sizeof(FreeList*) and at least size. Because sizes are always a multiple of alignment, and alignments are always powers of two, I believe these two requirements can be met as std::max(S, (size + A - 1) & ~size_t{A - 1}), where S = sizeof(FreeList*) and A = alignment_of(FreeList*).
  • When the above is done, I think the code can otherwise actually ignore the input alignment value (making it work for any alignment up to the alignment of the pools themselves, which ought to be sufficient for any object).
  • Would it make sense to use an array of pools, rather than a vector? That avoids one memory indirection at runtime, and for MAX_BLOCK_SIZE_BYTES=256, an array of 63 elements suffices (even with A=S=4).

@martinus
Copy link
Contributor Author

martinus commented Jun 7, 2022

Thanks @spia for having a close look!

I believe it may be desirable to round up size at the start of do_allocate / do_deallocate instead of using 0 for small sizes.

I can do that and it probably makes the resource more widely usable, but at least for the CCoinsMap it won't make a difference because it has the same alignment as the FreeList*.

Would it make sense to use an array of pools, rather than a vector?

Yes! I originally wanted to get rid of the template argument, but using std::array is actually a bit faster in my benchmarks so I'll go that route.

I'm working on that in this branch, where I'm now using std::array and rounding up: master...martinus:2022-06-very-not-scary-NodePoolResource

I hope to have something PR-ready in a week or two.

@sipa
Copy link
Member

sipa commented Jun 7, 2022

@martinus Here is a generic memory resource fuzzer: https://github.com/sipa/bitcoin/commits/202206_pmrallocfuzz

It also includes a somewhat modified version of your code, with a few more template parameters/sanity checks, and the rounding as I was thinking of it. Seems to work great. Feel free to try on top of your version of the resource, or combine ideas/whatever.

I like making the code generally usable in wider scenarios than just the ones we care about, as it means generic tests can be written for it, which need fewer assumptions.

Another idea for something to include in the tests which I haven't done: add a member function to the resource to ask if it still has any outstanding allocations (by summing the size of all allocated chunks, and comparing with the sum of all elements in the freelist).

@martinus
Copy link
Contributor Author

Another idea for something to include in the tests which I haven't done: add a member function to the resource to ask if it still has any outstanding allocations (by summing the size of all allocated chunks, and comparing with the sum of all elements in the freelist).

Excellent idea, I've now added an even stronger check: make sure that all data in the freelist actually comes from the chunks, doesn't overlap in any way, and each byte from the chunks can be found in the freelists or is still available

@fanquake fanquake removed this from the 24.0 milestone Sep 13, 2022
achow101 added a commit that referenced this pull request Apr 20, 2023
9f947fc Use PoolAllocator for CCoinsMap (Martin Leitner-Ankerl)
5e4ac5a Call ReallocateCache() on each Flush() (Martin Leitner-Ankerl)
1afca6b Add PoolResource fuzzer (Martin Leitner-Ankerl)
e19943f Calculate memory usage correctly for unordered_maps that use PoolAllocator (Martin Leitner-Ankerl)
b8401c3 Add pool based memory resource & allocator (Martin Leitner-Ankerl)

Pull request description:

  A memory resource similar to `std::pmr::unsynchronized_pool_resource`, but optimized for node-based containers. The goal is to be able to cache more coins with the same memory usage, and allocate/deallocate faster.

  This is a reimplementation of #22702. The goal was to implement it in a way that is simpler to review & test

  * There is now a generic `PoolResource` for allocating/deallocating memory. This has practically the same API as `std::pmr::memory_resource`. (Unfortunately I cannot use std::pmr because libc++ simply doesn't implement that API).
  * Thanks to sipa there is now a fuzzer for PoolResource! On a fast machine I ran it for ~770 million executions without finding any issue.

  * The estimation of the correct node size is now gone, PoolResource now has multiple pools and just needs to be created large enough to have space for the unordered_map nodes.

  I ran benchmarks with #22702, mergebase, and this PR. Frequency locked Intel i7-8700, clang++ 13.0.1 to reindex up to block 690000.

  ```sh
  bitcoind -dbcache=5000 -assumevalid=00000000000000000002a23d6df20eecec15b21d32c75833cce28f113de888b7 -reindex-chainstate -printtoconsole=0 -stopatheight=690000
  ```

  The performance is practically identical with #22702, just 0.4% slower. It's ~21% faster than master:

  ![Progress in Million Transactions over Time(2)](https://user-images.githubusercontent.com/14386/173288685-91952ade-f304-4825-8bfb-0725a71ca17b.png)

  ![Size of Cache in MiB over Time](https://user-images.githubusercontent.com/14386/173291421-e6b410be-ac77-479b-ad24-5fafcebf81eb.png)
  Note that on cache drops mergebase's memory doesnt go so far down because it does not free the `CCoinsMap` bucket array.

  ![Size of Cache in Million tx over Time(1)](https://user-images.githubusercontent.com/14386/173288703-a80c9c9e-93c8-4a16-9df8-610c89c61cc4.png)

ACKs for top commit:
  LarryRuane:
    ACK 9f947fc
  achow101:
    re-ACK 9f947fc
  john-moffett:
    ACK 9f947fc
  jonatack:
    re-ACK 9f947fc

Tree-SHA512: 48caf57d1775875a612b54388ef64c53952cd48741cacfe20d89049f2fb35301b5c28e69264b7d659a3ca33d4c714d47bafad6fd547c4075f08b45acc87c0f45
@bitcoin bitcoin locked and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.