Skip to content

Conversation

martinus
Copy link
Contributor

@martinus martinus commented Sep 4, 2019

This is a custom allocator for node based containers, as described in #16718 (comment). Hopefully, this is a less controversial change and easier to review than #16718.

The allocator retains memory of nodes, and allocates them in bulk. In my benchmarks, using this pool allocator for CCoinsMap gives about 16% faster -reindex-chainstate, and decreases memory usage by ~8%. The allocator is potentially useful for all node-based containers like std::list, std::map, std::unordered_map, etc. I believe this is especially useful on systems where malloc() and free() are relatively expensive operations, since this allocator will dramatically reduce these calls.

Some noteworthy properties about this allocator:

  • It relies on the fact that node based containers allocate nodes one at a time. If a container doesn't behave like this, the allocator will still work but won't be any faster.

  • The allocator determines the size of the nodes, which will then be used throughout the pool, by the first call to allocate(1). So if the container would not allocate something with the size of a node in its first call to allocate(1), the pool won't be usable for nodes since it was set up with a different size. With C++17 it would be possible to use a containers type node_type to correctly initialize the pool's size, but for now we have to rely on this heuristic.

  • Copying / moving / swapping a map propagates the allocator. So if two different pools are used for two containers a and b, calling a = b will from now on use b's pool in a.

  • The pool is not threadsafe. Two containers which use the same pool in different threads won't work.

  • A pool's memory is only actually destroyed in the pool's destructor.

I've added a benchmark to compare std::map, std::unordered_map with and without the pool allocator for the CCoinsMap.

@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 4, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #16910 (wallet: reduce loading time by using unordered maps by achow101)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@maflcko
Copy link
Member

maflcko commented Sep 4, 2019

re-run ci

@maflcko maflcko closed this Sep 4, 2019
@maflcko maflcko reopened this Sep 4, 2019
@jamesob
Copy link
Contributor

jamesob commented Sep 4, 2019

Concept ACK. Thanks for working on this. I did a cursory skim over the code and it looks very readable. The tests are nice too.

I'll do a more thorough review and some benchmark runs in the next few days.

@martinus martinus changed the title bulk pool allocator for node based containers faster & less memory for sync: bulk pool allocator for node based containers Sep 5, 2019
@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 9e0f222 to ad55a26 Compare September 5, 2019 16:49
@martinus martinus closed this Sep 5, 2019
@martinus martinus reopened this Sep 5, 2019
@martinus
Copy link
Contributor Author

martinus commented Sep 11, 2019

Here is a graph of another comparison run of the indexing progress over time that I get on my machine (Intel i7-8700 @ 3.20GHz, 32GB of RAM)

out

  • 3432sec for 2019-08-bulkpoolallocator, 4135 sec for master.
  • 6.973.488 kbyte max RSS for 2019-08-bulkpoolallocator, 7.643.956 kbyte for master.

The straight lines where no progress is made are when the cache is full and then dumped into the leveldb database.

@ktprime
Copy link

ktprime commented Sep 12, 2019

I want to how to migrate your good bulk pool to my project.

@martinus
Copy link
Contributor Author

martinus commented Sep 12, 2019

I want to how to migrate your good bulk pool to my project.

Feel free to copy the bulk_pool.h to your project. It's under the MIT license. I might extract the file into a standalone project with more benchmarks, but thats not high on my priority list.

I am interested though in any improvements you see with this allocator!

@promag
Copy link
Contributor

promag commented Sep 15, 2019

Looks like the improvement grows over time/progress? Have you checked with a greater -stopatheight?

@martinus
Copy link
Contributor Author

I've also made a graph with -stopatheight 594000: out

@jamesob
Copy link
Contributor

jamesob commented Sep 27, 2019

After running benchmarks across a few different machines, I'm pretty confused.

The raw results are below, but the headline is that I'm consistently seeing branches using this allocator outperform master but also vice versa depending upon the host. It's about half and half, and when master wins it's by a pretty dramatic margin.

E.g. master is reliably faster than this branch and #16718 on bench-ssd-2, bench-ssd-5, and bench-hdd-5 (except for one weird run where master took 41 hours??). On all other hosts (ssd-3, ssd-4, hdd-4), these branches beat master consistently.

This is very confusing to me because AFAICT -reindex-chainstate should be a deterministic process - i.e. I'd expect the same logical execution for a given height on mainnet regardless of host. I guess maybe blockfile layouts can differ, but would that really be responsible for so dramatic a difference?

Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I'm dropping all caches (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;), tuning with pyperf (sudo /usr/local/bin/pyperf system tune;), and I've ensured that the CPU governors are set to performance.

Hostname:              bench-ssd-2
Kernel:                Linux 4.9.0-8-amd64
OS:                    Debian GNU/Linux 9
RAM (GB):              7.71
Architecture:          x86_64
CPU(s):                4
Thread(s) per core:    1
Core(s) per socket:    4
Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
-reindex-chainstate to 550,000 (dbcache 4000,4000,5000)


hostname     branch                               runtime   peak mem  cpu%
--------     ------                               -------   --------  ----

bench-ssd-2  jamesob/2019-08-robinhood            8:44:15   5268.90MB 356%
                                                  8:44:08   5279.89MB 356%
                                                  8:54:16   6209.94MB 350%

             master                               3:56:18   6257.37MB 211%
                                                  3:52:25   6322.87MB 215%
                                                  6:09:23   7184.23MB 138%

             martinus/2019-08-bulkpoolallocator   8:57:44   5680.03MB 349%
                                                  8:57:50   5710.53MB 349%
                                                  9:19:26   6634.91MB 336%

bench-ssd-3  jamesob/2019-08-robinhood            1:53:11   5288.62MB 88%
                                                  1:56:02   5300.98MB 87%
                                                  2:28:13   6289.06MB 71%

             martinus/2019-08-bulkpoolallocator   2:14:50   5703.27MB 83%
                                                  2:11:09   5709.48MB 85%
                                                  3:28:15   6747.18MB 56%

             master                               2:35:11   6272.42MB 88%
                                                  2:32:48   6341.90MB 89%
                                                  5:31:51   7212.80MB 45%

bench-ssd-4  master                               2:54:59   6268.15MB 79%
                                                  2:35:52   6277.92MB 88%
                                                  5:14:52   7155.87MB 47%

             jamesob/2019-08-robinhood            1:57:07   5269.11MB 85%
                                                  1:53:59   5351.89MB 88%
                                                  2:28:55   6312.53MB 71%

             martinus/2019-08-bulkpoolallocator   2:12:36   5715.95MB 84%
                                                  2:09:30   5668.43MB 86%
                                                  3:18:38   6707.09MB 59%

bench-ssd-5  master                               3:58:37   6263.33MB 210%
                                                  3:40:44   6214.23MB 226%
                                                  5:37:39   7109.82MB 151%

             martinus/2019-08-bulkpoolallocator   8:53:29   5710.38MB 352%
                                                  8:47:41   5692.83MB 356%
                                                  9:25:06   6553.30MB 333%

             jamesob/2019-08-robinhood            8:41:53   5294.47MB 357%
                                                  8:36:57   5331.02MB 360%
                                                  8:53:55   6176.95MB 350%

bench-hdd-4  martinus/2019-08-bulkpoolallocator   3:50:01   5729.15MB 49%
                                                  3:53:44   5793.15MB 47%
                                                  17:20:06  6741.35MB 11%

             jamesob/2019-08-robinhood            2:55:47   5310.89MB 57%
                                                  2:54:19   5306.11MB 57%
                                                  9:04:22   6220.20MB 19%

             master                               10:11:25  6257.50MB 23%
                                                  5:25:15   6256.57MB 42%
                                                  40:32:45  7123.75MB 6%

bench-hdd-5  martinus/2019-08-bulkpoolallocator   10:04:43  5703.16MB 310%
                                                  10:03:57  5775.29MB 311%
                                                  24:04:41  6728.74MB 130%

             master                               8:05:00   6248.24MB 103%
                                                  6:08:48   6250.89MB 135%
                                                  41:39:38  7152.32MB 20%

             jamesob/2019-08-robinhood            9:25:51   5295.75MB 329%
                                                  9:24:18   5297.03MB 330%
                                                  13:15:56  6134.24MB 235%

I'm wondering if this allocator is behaving pathologically based upon some different initial condition across the hosts, but for now I'm pretty stumped. In the coming weeks I'll work on tooling that will allow me to get a more precise idea of where time is being spent and will roll all that into bitcoinperf.

@martinus
Copy link
Contributor Author

It's really strange that the difference is so dramatic. Do you have the debug.log files, or some graphs with the progresss? It would be interesting to see where it's slowing down: e.g. if it's in the DB sync, or continuously, or right before a sync.

I think since your host only has 7.7GB memory the slowdown might come from the host swapping. Maybe the problem is the way bitcoin is handling the cache: As transactions are added to the cache, it slowly gets full.
With some bad luck / bad configuration, the hashmap's load factor can get full close to the actual dbcache limit. If it has to double it's size right at that time, it will temporarily need a lot more memory than the dbcache, has to rehash, and once this costly operation is done, it will be over the dbcache limit, and the cache is dumped and thrown away.

So in that case it would be a much better logic to flush the cache once the load factor reaches maximum, and when another insert would bump it over the limit.

@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 52ee573 to 329754d Compare October 5, 2019 16:55
@martinus
Copy link
Contributor Author

martinus commented Oct 5, 2019

rebased & squashed in 329754d

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed code and looks pretty good. Just had two questions below.

In my benchmarks, using this pool allocator for CCoinsMap gives about
16% faster `-reindex-chainstate`, and decreases memory usage by 8%.
The allocator is potentially useful for all node-based containers like
`std::list`, `std::map`, `std::unordered_map`.
@martinus martinus force-pushed the 2019-08-bulkpoolallocator branch from 329754d to 54ada87 Compare November 1, 2019 07:20
@@ -215,10 +216,11 @@ class CCoinsViewCache : public CCoinsViewBacked
* declared as "const".
*/
mutable uint256 hashBlock;
mutable CCoinsMap cacheCoins;
mutable bulk_pool::Pool cacheCoinsPool{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the pool only grows and uses increasing memory over time, and that its lifetime is effectively the lifetime of the entire node, would it make sense to add some kind of garbage collection method for CCoinsViewCache that would free unneeded memory (maybe after finishing syncing, or after flushing, or on an interval, or from an RPC a user could call)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following code for something i'm working on. It should be amortized cheap to call after every erase; you could also make it have different parameters during sync or after (e.g, always compact after sync, only compact if it's above 50MB free during sync).

 static void resize_if_savings(CTxMemPoolEntry::relatives& relatives) {
       assert(relatives.max_load_factor() == 1)
       // If we're at 0, clear out the map
       if (relatives.size() == 0) {
           CTxMemPool::relatives tmp;
           std::swap(tmp, relatives);
           return;
       }
       // don't bother saving for small enough sets
       // 19 buckets isn't very large, and fits in with the usual
       // prime rehash policies
       if (relatives.bucket_count() <= 19) return;
       // don't bother rehashing if we're more than half full
       const size_t full_size = relatives.bucket_count();
       if (relatives.size() > full_size/2) return;
   
       CTxMemPool::relatives tmp{std::make_move_iterator(relatives.begin()), std::make_move_iterator(relatives.end()), relatives.size()};
       std::swap(tmp, relatives);
   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeremyRubin that looks interesting. I think the if (relatives.size() > full_size/2) return; is a bit too aggressive though. For a hashmap uses doubling as resize strategy: if it has e.g. 256 elements and you add one it will double its size, and when you remove one element again it would already trigger the rehashing to decrease its size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my system the rehash policy is the prime less than double policy, which ensures that we have a growth factor < 2 for all resizes. So in the case of 256, we would first "double" to 503. Then we would be at 257/503 "fullness", where half would be 251. So we can remove up to 6 elements before triggering a re-hash. So the pathological cases are a bit more precise, see below:

If you're at exactly 251, then you add one, you double to 503, and then you can remove two before triggering a rehash (strict inequality). Let's say you choose to remove a third. Then you'll rehash to either 467 or 257 (implementations depending -- I believe it's usually 257 because we go to the smallest prime larger than N). That means you will need to remove down to 233 (~20 more elements) before triggering a rehash OR more likely ~125 elements. Let's say instead, if you add back another N, you won't trigger a rehash until you hit 467, so you can add back another ~215 elements before having to go again OR more likely, 7 elements.

So there is a case to be made that it's less than ideal, but it's not quite as bad given the primes.

There's also a case to be made (at least for the use cases I'm looking at) where you're usually in a streak of erases, followed by a streak of adds. So while it is possible to trigger the bad behavior, it's atypical.

Therefore I don't actually think it's that bad to use half as the shrink test, given that it quickly gets us shrunk to a more than half full table.

But if we do a resize at, say, at n <= (buckets - (buckets/100)*54), and resize to (n + (n/100)36), then we can always be able to add at least 36% after a resize and remove 1/(1 + 0.361.08) - 0.36 ~~ 36% more before resizing again. (The ~1.08 more than 36% I think, is half the average overshoot prime to prime. This is not a great estimate but can be improved).

@ryanofsky
Copy link
Contributor

ryanofsky commented Nov 1, 2019

James walked me through the benchmarks, and I guess a summary is that the benchmarks do generally look good and show expected decreases in runtime and memory usage at varying dbcache sizes.

The problem is that in some benchmark setups, reindexing reliably but for no explicable reason takes a lot longer (9 hours vs 2 hours) on the bulk allocator and robinhood branches than it does on the master branch. That is, the robinhood and bulk allocator branches generally perform better in the ways we would expect, but on some particular benchmarking machines with particular datadirs, reindexing takes a lot longer when using bulk allocator and robinhood branches instead of the master branch.

Also, very strangely, copying the datadir from a benchmarking machine where reindexing with the bulk allocator branch was fast, to a benchmarking machine where reindexing with the bulk allocator branch was slow, somehow makes reindexing on that same machine fast as well. This is suspicious and probably indicates either a mistake made in benchmark setup, or the presence of a pre-existing pathological performance bug in bitcoin somehow triggered by changing the allocator. (But even that doesn't make much sense, because while it's possible to imagine the layout of block data in the datadir impacting reindex time, the actual block reads done during reindexing should be identical between the master and bulk allocator branches.)

It does seem like we need to debug the case where switching to the bulk allocator appears to cause reindexing slowdowns with particular datadirs, even though using the bulk allocator could only be causing these slowdowns very indirectly.

@jamesob
Copy link
Contributor

jamesob commented Nov 1, 2019

Just as an update, I've got a plan for diagnosing the benching weirdness I've seen thus far. With any luck I'll be posting an analysis (or at least additional information) in the next few days.

@martinus
Copy link
Contributor Author

martinus commented Nov 1, 2019

Is it possible that this weirdness can be explained by different assumedvalid points? Either in the branches, or on the disk. It was a surprise to me: #17060 (comment)

As @MarcoFalke wrote in #17060 (comment), it is probably best to either use -noassumevalid or use the same block hash, and make sure that block is actually available on all disks

@jamesob
Copy link
Contributor

jamesob commented Nov 8, 2019

Okay, I'm back with good news and bad(ish) news. The good news is that the bench weirdness has been resolved by comparing master (6a7c40b, bench/master.1) to a version of this branch rebased on top of that same commit (bench/alloc.1). So I'm guessing the weirdness before may have been related to an assumevalid bump, but I'm not definitively sure.

The bad news is that the now-very-consistent results don't show as much of an improvement as I expected.

I did two separate runs over the course of a few days - for each machine, the branches were tested twice back-to-back in random order (results are ordered the same). The command tested was bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=$DBCACHE -connect=0.

host         branch                 time              %     maxmem  cpu% dbcache
bench-ssd-2  bench/master.1             9:13:57    1.00  5777.93MB 344% dbcache=4000MB
bench-ssd-2  bench/master.1             9:16:30    1.00  5760.55MB 343% dbcache=4000MB
bench-ssd-2  bench/alloc.1              9:01:19    0.97  5466.12MB 348% dbcache=4000MB
bench-ssd-2  bench/alloc.1              9:01:06    0.97  5453.22MB 349% dbcache=4000MB

bench-ssd-3  bench/alloc.1              9:03:00    0.97  5451.97MB 348% dbcache=4000MB
bench-ssd-3  bench/alloc.1              9:01:58    0.97  5474.70MB 348% dbcache=4000MB
bench-ssd-3  bench/master.1             9:16:33    1.00  5772.59MB 343% dbcache=4000MB
bench-ssd-3  bench/master.1             9:18:04    1.00  5781.47MB 342% dbcache=4000MB

bench-ssd-4  bench/master.1             9:15:28    1.00  5781.36MB 343% dbcache=4000MB
bench-ssd-4  bench/master.1             9:15:41    1.00  5765.45MB 343% dbcache=4000MB
bench-ssd-4  bench/alloc.1              9:02:24    0.98  5451.96MB 348% dbcache=4000MB
bench-ssd-4  bench/alloc.1              9:01:49    0.98  5443.95MB 348% dbcache=4000MB

bench-ssd-5  bench/alloc.1              8:57:42    0.97  5456.22MB 351% dbcache=4000MB
bench-ssd-5  bench/alloc.1              8:57:48    0.97  5464.43MB 351% dbcache=4000MB
bench-ssd-5  bench/master.1             9:12:06    1.00  5787.68MB 345% dbcache=4000MB
bench-ssd-5  bench/master.1             9:11:49    1.00  5777.87MB 346% dbcache=4000MB

bench-strong bench/alloc.1              14:58:34   0.99  5425.66MB 548% dbcache=4000MB
bench-strong bench/alloc.1              14:57:55   0.99  5463.18MB 548% dbcache=4000MB
bench-strong bench/master.1             15:04:45   1.00  5751.64MB 545% dbcache=4000MB
bench-strong bench/master.1             15:04:06   1.00  5734.97MB 545% dbcache=4000MB

So we see very consistent times across branches and machines, but the gain associated with this branch is pretty mild (~2-3%).

All of the bench-ssd-* machines resemble each other pretty closely. bench-strong has much more CPU and RAM (which led me to be initially surprised by the results) but then I realized it has a spinning disk. The specs for the machines are below:

(representative of the bench-ssd-* machines)

key value
hostname bench-ssd-2
cpu_model_name Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
ram_gb 7.712375640869141
os ['Debian GNU/Linux', '9', 'stretch']
arch x86_64
kernel 4.9.0-8-amd64
read_iops (/home/ccl/bitcoinperf) 2032
write_iops (/home/ccl/bitcoinperf) 672

bench-strong

key value
hostname bench-strong
cpu_model_name Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ram_gb 31.353431701660156
os ['Debian GNU/Linux', '10', 'buster']
arch x86_64
kernel 4.19.0-5-amd64
read_iops (/home/james/.bitcoin) 332
write_iops (/home/james/.bitcoin) 113

In order to make certain that these runtime characteristics aren't unstable based on dbcache, I reran this with dbcache=5000 (from dbcache=4000) - in previous benchmarking sessions I hadn't seen the weird "bi-modal" behavior until dbcache hit 5000.

host         branch                 time              %     maxmem  cpu% dbcache
bench-ssd-2  bench/alloc.1              9:12:44    0.98  6397.99MB 341% dbcache=5000MB
bench-ssd-2  bench/alloc.1              9:11:06    0.98  6397.02MB 342% dbcache=5000MB
bench-ssd-2  bench/master.1             9:22:27    1.00  6697.26MB 340% dbcache=5000MB
bench-ssd-2  bench/master.1             9:22:28    1.00  6609.71MB 339% dbcache=5000MB

bench-ssd-3  bench/master.1             9:27:27    1.00  6738.98MB 336% dbcache=5000MB
bench-ssd-3  bench/master.1             9:24:56    1.00  6770.75MB 338% dbcache=5000MB
bench-ssd-3  bench/alloc.1              9:25:38    1.00  6454.57MB 334% dbcache=5000MB
bench-ssd-3  bench/alloc.1              9:12:47    0.97  6418.94MB 341% dbcache=5000MB

bench-ssd-4  bench/alloc.1              9:12:12    0.98  6543.72MB 341% dbcache=5000MB
bench-ssd-4  bench/alloc.1              9:26:04    1.00  6394.11MB 333% dbcache=5000MB
bench-ssd-4  bench/master.1             9:23:44    1.00  6749.35MB 338% dbcache=5000MB
bench-ssd-4  bench/master.1             9:22:57    0.99  6673.99MB 339% dbcache=5000MB

bench-ssd-5  bench/alloc.1              9:04:42    0.97  6380.43MB 346% dbcache=5000MB
bench-ssd-5  bench/alloc.1              9:06:23    0.97  6380.80MB 345% dbcache=5000MB
bench-ssd-5  bench/master.1             9:23:53    1.00  6666.49MB 338% dbcache=5000MB
bench-ssd-5  bench/master.1             9:20:08    0.99  6604.82MB 341% dbcache=5000MB

bench-strong bench/master.1             15:00:44   1.00  6869.63MB 545% dbcache=5000MB
bench-strong bench/master.1             14:58:43   1.00  6706.32MB 546% dbcache=5000MB
bench-strong bench/alloc.1              15:02:07   1.00  6511.17MB 545% dbcache=5000MB
bench-strong bench/alloc.1              15:02:24   1.00  6415.48MB 545% dbcache=5000MB

As you can see, the results are very similar. In fact I was a little surprised that an extra gig of dbcache doesn't seem to make any real difference in terms of runtime - if nothing else it marginally hurt the runtimes.


So anyway, take this all for what it is. I wouldn't be surprised if after the wide variance seen in my benchmarking that this is all taken with a grain of salt, but I've done what I can to rule out spurious measures. That said, anyone should be able to reproduce these results by making superficial modifications to this script.

I would say if these benchmarks are indeed characteristic of the performance gain associated with this PR, I am not in favor of merging it as such because I think the risk associated with swapping out an allocator isn't compensated for by a 3% reduction in IBD runtime. But that said I am (hopefully obviously) a big fan of this sort of work, and will continue to cheer @martinus on in his attempts to optimize what is certainly one of the biggest bottlenecks in bitcoind.

@martinus
Copy link
Contributor Author

martinus commented Nov 8, 2019

Thanks for getting to the bottom of this @jamesob! It's good to see that the allocator actually seems to do it's job as it's supposed to be, even when the benefit is small. It's probably bigger when validation doesn't have be be redone, but the additional code complexity is probably not worth it.

@ryanofsky
Copy link
Contributor

It's probably bigger when validation doesn't have be be redone, but the additional code complexity is probably not worth it.

It seems like a pool allocator could be a basic building block for a lot of future optimizations. It might also be useful for testing overhead of allocations even in places where we wouldn't want to enable it in released code. So I could see the effort here being useful even if we aren't looking to merge the PR now. But would suggest closing the PR or marking it work in progress to update the status.

One thing I'm not clear on is the difference between 3% performance improvement last measured and 16% initially measured. If there are theories about this, I'd be curious.

@jamesob jamesob mentioned this pull request Nov 18, 2019
18 tasks
@martinus
Copy link
Contributor Author

One thing I'm not clear on is the difference between 3% performance improvement last measured and 16% initially measured. If there are theories about this, I'd be curious.

I'm pretty sure the difference is due to to additional validation due to -noassumevalid. With validation, validation is the bottleneck so the improvement is not as pronounced. When no validation needs to happen, you see much bigger gains from this PR.

@martinus martinus closed this Nov 19, 2019
@maflcko
Copy link
Member

maflcko commented Nov 19, 2019

Oh, I didn't want to imply in #17060 (comment) that -noassumevalid is the only setting that makes sense for benchmarking. Every release has a different assumed valid block set by default. So when looking at total IBD time, it makes sense to consider the setting where no validation happens because it is skipped by default. If there is a more than 10% speed-up, it should not be neglected.

@jamesob
Copy link
Contributor

jamesob commented Aug 9, 2021

I've benchmarked @martinus' rebased version of this branch (per #22640 (comment)) and I'm happy to report that I'm seeing a roughly 9% speedup along with an 8% reduction in memory usage (with a very high dbcache setting). A pull request for this branch should probably be reopened, though due to the rebase I think @martinus may have to open a separate pull request.


ibd local range 500000 540000

commands index

bench name command
ibd.local.range.500000.540000 bitcoind -dbcache=10000 -debug=coindb -debug=bench -listen=0 -connect=0 -addnode=127.0.0.1:8888 -prune=9999999 -printtoconsole=0 -assumevalid=000000000000000000176c192f42ad13ab159fdb20198b87e7ba3c001e47b876

martinus/2019-08-bulkpoolallocator vs. $mergebase (absolute)

bench name x martinus/2019-08-bulkpoolallocator $mergebase
ibd.local.range.500000.540000.total_secs 3 2850.0195 (± 9.6506) 3134.5484 (± 11.6699)
ibd.local.range.500000.540000.peak_rss_KiB 3 5831204.0000 (± 3492.0817) 6358314.6667 (± 3223.6564)
ibd.local.range.500000.540000.cpu_kernel_secs 3 261.0567 (± 1.3968) 267.7467 (± 0.7879)
ibd.local.range.500000.540000.cpu_user_secs 3 12815.7433 (± 3.5496) 13162.3167 (± 13.4203)

martinus/2019-08-bulkpoolallocator vs. $mergebase (relative)

bench name x martinus/2019-08-bulkpoolallocator $mergebase
ibd.local.range.500000.540000.total_secs 3 1 1.100
ibd.local.range.500000.540000.peak_rss_KiB 3 1 1.090
ibd.local.range.500000.540000.cpu_kernel_secs 3 1 1.026
ibd.local.range.500000.540000.cpu_user_secs 3 1 1.027

@martinus
Copy link
Contributor Author

martinus commented Aug 9, 2021

Thanks for the thorough benchmark! I'll try to clean up the code a bit and then open another PR with the pool allocator.

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants