-
Notifications
You must be signed in to change notification settings - Fork 37.7k
faster & less memory for sync: bulk pool allocator for node based containers #16801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
re-run ci |
Concept ACK. Thanks for working on this. I did a cursory skim over the code and it looks very readable. The tests are nice too. I'll do a more thorough review and some benchmark runs in the next few days. |
9e0f222
to
ad55a26
Compare
Here is a graph of another comparison run of the indexing progress over time that I get on my machine (Intel i7-8700 @ 3.20GHz, 32GB of RAM)
The straight lines where no progress is made are when the cache is full and then dumped into the leveldb database. |
I want to how to migrate your good bulk pool to my project. |
Feel free to copy the bulk_pool.h to your project. It's under the MIT license. I might extract the file into a standalone project with more benchmarks, but thats not high on my priority list. I am interested though in any improvements you see with this allocator! |
Looks like the improvement grows over time/progress? Have you checked with a greater |
After running benchmarks across a few different machines, I'm pretty confused. The raw results are below, but the headline is that I'm consistently seeing branches using this allocator outperform master but also vice versa depending upon the host. It's about half and half, and when master wins it's by a pretty dramatic margin. E.g. master is reliably faster than this branch and #16718 on bench-ssd-2, bench-ssd-5, and bench-hdd-5 (except for one weird run where master took 41 hours??). On all other hosts (ssd-3, ssd-4, hdd-4), these branches beat master consistently. This is very confusing to me because AFAICT -reindex-chainstate should be a deterministic process - i.e. I'd expect the same logical execution for a given height on mainnet regardless of host. I guess maybe blockfile layouts can differ, but would that really be responsible for so dramatic a difference? Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I'm dropping all caches (
I'm wondering if this allocator is behaving pathologically based upon some different initial condition across the hosts, but for now I'm pretty stumped. In the coming weeks I'll work on tooling that will allow me to get a more precise idea of where time is being spent and will roll all that into bitcoinperf. |
It's really strange that the difference is so dramatic. Do you have the debug.log files, or some graphs with the progresss? It would be interesting to see where it's slowing down: e.g. if it's in the DB sync, or continuously, or right before a sync. I think since your host only has 7.7GB memory the slowdown might come from the host swapping. Maybe the problem is the way bitcoin is handling the cache: As transactions are added to the cache, it slowly gets full. So in that case it would be a much better logic to flush the cache once the load factor reaches maximum, and when another insert would bump it over the limit. |
52ee573
to
329754d
Compare
rebased & squashed in 329754d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed code and looks pretty good. Just had two questions below.
In my benchmarks, using this pool allocator for CCoinsMap gives about 16% faster `-reindex-chainstate`, and decreases memory usage by 8%. The allocator is potentially useful for all node-based containers like `std::list`, `std::map`, `std::unordered_map`.
329754d
to
54ada87
Compare
@@ -215,10 +216,11 @@ class CCoinsViewCache : public CCoinsViewBacked | |||
* declared as "const". | |||
*/ | |||
mutable uint256 hashBlock; | |||
mutable CCoinsMap cacheCoins; | |||
mutable bulk_pool::Pool cacheCoinsPool{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the pool only grows and uses increasing memory over time, and that its lifetime is effectively the lifetime of the entire node, would it make sense to add some kind of garbage collection method for CCoinsViewCache
that would free unneeded memory (maybe after finishing syncing, or after flushing, or on an interval, or from an RPC a user could call)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the following code for something i'm working on. It should be amortized cheap to call after every erase; you could also make it have different parameters during sync or after (e.g, always compact after sync, only compact if it's above 50MB free during sync).
static void resize_if_savings(CTxMemPoolEntry::relatives& relatives) {
assert(relatives.max_load_factor() == 1)
// If we're at 0, clear out the map
if (relatives.size() == 0) {
CTxMemPool::relatives tmp;
std::swap(tmp, relatives);
return;
}
// don't bother saving for small enough sets
// 19 buckets isn't very large, and fits in with the usual
// prime rehash policies
if (relatives.bucket_count() <= 19) return;
// don't bother rehashing if we're more than half full
const size_t full_size = relatives.bucket_count();
if (relatives.size() > full_size/2) return;
CTxMemPool::relatives tmp{std::make_move_iterator(relatives.begin()), std::make_move_iterator(relatives.end()), relatives.size()};
std::swap(tmp, relatives);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeremyRubin that looks interesting. I think the if (relatives.size() > full_size/2) return;
is a bit too aggressive though. For a hashmap uses doubling as resize strategy: if it has e.g. 256 elements and you add one it will double its size, and when you remove one element again it would already trigger the rehashing to decrease its size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my system the rehash policy is the prime less than double policy, which ensures that we have a growth factor < 2 for all resizes. So in the case of 256, we would first "double" to 503. Then we would be at 257/503 "fullness", where half would be 251. So we can remove up to 6 elements before triggering a re-hash. So the pathological cases are a bit more precise, see below:
If you're at exactly 251, then you add one, you double to 503, and then you can remove two before triggering a rehash (strict inequality). Let's say you choose to remove a third. Then you'll rehash to either 467 or 257 (implementations depending -- I believe it's usually 257 because we go to the smallest prime larger than N). That means you will need to remove down to 233 (~20 more elements) before triggering a rehash OR more likely ~125 elements. Let's say instead, if you add back another N, you won't trigger a rehash until you hit 467, so you can add back another ~215 elements before having to go again OR more likely, 7 elements.
So there is a case to be made that it's less than ideal, but it's not quite as bad given the primes.
There's also a case to be made (at least for the use cases I'm looking at) where you're usually in a streak of erases, followed by a streak of adds. So while it is possible to trigger the bad behavior, it's atypical.
Therefore I don't actually think it's that bad to use half as the shrink test, given that it quickly gets us shrunk to a more than half full table.
But if we do a resize at, say, at n <= (buckets - (buckets/100)*54), and resize to (n + (n/100)36), then we can always be able to add at least 36% after a resize and remove 1/(1 + 0.361.08) - 0.36 ~~ 36% more before resizing again. (The ~1.08 more than 36% I think, is half the average overshoot prime to prime. This is not a great estimate but can be improved).
James walked me through the benchmarks, and I guess a summary is that the benchmarks do generally look good and show expected decreases in runtime and memory usage at varying dbcache sizes. The problem is that in some benchmark setups, reindexing reliably but for no explicable reason takes a lot longer (9 hours vs 2 hours) on the bulk allocator and robinhood branches than it does on the master branch. That is, the robinhood and bulk allocator branches generally perform better in the ways we would expect, but on some particular benchmarking machines with particular datadirs, reindexing takes a lot longer when using bulk allocator and robinhood branches instead of the master branch. Also, very strangely, copying the datadir from a benchmarking machine where reindexing with the bulk allocator branch was fast, to a benchmarking machine where reindexing with the bulk allocator branch was slow, somehow makes reindexing on that same machine fast as well. This is suspicious and probably indicates either a mistake made in benchmark setup, or the presence of a pre-existing pathological performance bug in bitcoin somehow triggered by changing the allocator. (But even that doesn't make much sense, because while it's possible to imagine the layout of block data in the datadir impacting reindex time, the actual block reads done during reindexing should be identical between the master and bulk allocator branches.) It does seem like we need to debug the case where switching to the bulk allocator appears to cause reindexing slowdowns with particular datadirs, even though using the bulk allocator could only be causing these slowdowns very indirectly. |
Just as an update, I've got a plan for diagnosing the benching weirdness I've seen thus far. With any luck I'll be posting an analysis (or at least additional information) in the next few days. |
Is it possible that this weirdness can be explained by different assumedvalid points? Either in the branches, or on the disk. It was a surprise to me: #17060 (comment) As @MarcoFalke wrote in #17060 (comment), it is probably best to either use |
Okay, I'm back with good news and bad(ish) news. The good news is that the bench weirdness has been resolved by comparing master (6a7c40b, The bad news is that the now-very-consistent results don't show as much of an improvement as I expected. I did two separate runs over the course of a few days - for each machine, the branches were tested twice back-to-back in random order (results are ordered the same). The command tested was
So we see very consistent times across branches and machines, but the gain associated with this branch is pretty mild (~2-3%). All of the bench-ssd-* machines resemble each other pretty closely. (representative of the bench-ssd-* machines)
bench-strong
In order to make certain that these runtime characteristics aren't unstable based on dbcache, I reran this with
As you can see, the results are very similar. In fact I was a little surprised that an extra gig of dbcache doesn't seem to make any real difference in terms of runtime - if nothing else it marginally hurt the runtimes. So anyway, take this all for what it is. I wouldn't be surprised if after the wide variance seen in my benchmarking that this is all taken with a grain of salt, but I've done what I can to rule out spurious measures. That said, anyone should be able to reproduce these results by making superficial modifications to this script. I would say if these benchmarks are indeed characteristic of the performance gain associated with this PR, I am not in favor of merging it as such because I think the risk associated with swapping out an allocator isn't compensated for by a 3% reduction in IBD runtime. But that said I am (hopefully obviously) a big fan of this sort of work, and will continue to cheer @martinus on in his attempts to optimize what is certainly one of the biggest bottlenecks in bitcoind. |
Thanks for getting to the bottom of this @jamesob! It's good to see that the allocator actually seems to do it's job as it's supposed to be, even when the benefit is small. It's probably bigger when validation doesn't have be be redone, but the additional code complexity is probably not worth it. |
It seems like a pool allocator could be a basic building block for a lot of future optimizations. It might also be useful for testing overhead of allocations even in places where we wouldn't want to enable it in released code. So I could see the effort here being useful even if we aren't looking to merge the PR now. But would suggest closing the PR or marking it work in progress to update the status. One thing I'm not clear on is the difference between 3% performance improvement last measured and 16% initially measured. If there are theories about this, I'd be curious. |
I'm pretty sure the difference is due to to additional validation due to |
Oh, I didn't want to imply in #17060 (comment) that |
I've benchmarked @martinus' rebased version of this branch (per #22640 (comment)) and I'm happy to report that I'm seeing a roughly 9% speedup along with an 8% reduction in memory usage (with a very high dbcache setting). A pull request for this branch should probably be reopened, though due to the rebase I think @martinus may have to open a separate pull request. commands index
martinus/2019-08-bulkpoolallocator vs. $mergebase (absolute)
martinus/2019-08-bulkpoolallocator vs. $mergebase (relative)
|
Thanks for the thorough benchmark! I'll try to clean up the code a bit and then open another PR with the pool allocator. |
This is a custom allocator for node based containers, as described in #16718 (comment). Hopefully, this is a less controversial change and easier to review than #16718.
The allocator retains memory of nodes, and allocates them in bulk. In my benchmarks, using this pool allocator for
CCoinsMap
gives about 16% faster-reindex-chainstate
, and decreases memory usage by ~8%. The allocator is potentially useful for all node-based containers likestd::list
,std::map
,std::unordered_map
, etc. I believe this is especially useful on systems where malloc() and free() are relatively expensive operations, since this allocator will dramatically reduce these calls.Some noteworthy properties about this allocator:
It relies on the fact that node based containers allocate nodes one at a time. If a container doesn't behave like this, the allocator will still work but won't be any faster.
The allocator determines the size of the nodes, which will then be used throughout the pool, by the first call to allocate(1). So if the container would not allocate something with the size of a node in its first call to allocate(1), the pool won't be usable for nodes since it was set up with a different size. With C++17 it would be possible to use a containers type
node_type
to correctly initialize the pool's size, but for now we have to rely on this heuristic.Copying / moving / swapping a map propagates the allocator. So if two different pools are used for two containers a and b, calling
a = b
will from now on use b's pool in a.The pool is not threadsafe. Two containers which use the same pool in different threads won't work.
A pool's memory is only actually destroyed in the pool's destructor.
I've added a benchmark to compare std::map, std::unordered_map with and without the pool allocator for the
CCoinsMap
.