-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Better SigCache Implementation #8895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Compiling on OS X, with Xcode 8
|
@fanquake should be fixed, wasn't included for some reason. |
Concept ACK and code review ACK. I still need to read up on the security/performance guarantees to understand better, but the implementation looks sane. |
Concept ACK, if nothing else the increase in entries-per-byte should be a big win for memory usage. |
I just pushed up 4 commits which further improve the cuckoo cache. These commits add epochs/generations for the cache entries which improves the hit rate. Below is a brief description of the changes, for more detail, see the code & code documentation. The first 2 commits are changes to the tests. The first commit simplifies what is checked in the existing tests (it was over specific to that Cache's design). The second adds a new test which fails under the existing code because the code does not prioritize newer entries. The third patch adds generations to cache which work as follows: a bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size. The fourth commit increases the number of hashes used per element from 2 to 8. The benefit of this change is that it permits using a large generation size (e.g., 45%) compared to 2 hashes (30%). A larger generation size is the "effective size" of the cache, i.e., new entries stay in the current generation for longer. Performance is markedly better, especially under attack scenarios. @morcos has run several simulations which back up this claim. There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case. On a slightly separate note, another side benefit of this cache that I failed to mention when I opened the PR (before these added commits) is that inserts are guaranteed to terminate (in the sigcache currently in master, insert may run forever if GetRand is "unlucky"). |
1e86d21
to
1cfb7d1
Compare
I've squashed all the commits (unsquashed still available here if you've already looked at this. master...JeremyRubin:cuckoocache-pull-request-not-squashed). I've also edited the PR message. The current Travis failed build seems to be related to the ongoing Dyn DDoS attack, as Travis is telling me variants of |
1cfb7d1
to
3ba2bf4
Compare
*/ | ||
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const | ||
{ | ||
return {hash_function.template operator()<0>(e) % size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of size, store the log2 of the size, and use a bitshift or mask here. Modulus operations are very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a rather slow operation, but I don't quite fully understand your operation... doesn't that only work if you're at power of two size?
*/ | ||
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const | ||
{ | ||
return {hash_function.template operator()<0>(e) % size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the size currently isn't restricted to be a power of two, and by default is not a power of two. Are you suggesting we limit it to be a power of two?
*/ | ||
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const | ||
{ | ||
return {hash_function.template operator()<0>(e) % size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :) Sorry if that wasn't obvious. We could benchmark whether it matters compared to other operations of course, but restricting to a power of two means you at most get something that's a factor sqrt(2) off of your desired size, which seems acceptable for a cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, long rambly response below:
Overall, I think it's a reasonable idea to do this. Modulo is slow. It's not the bottleneck, but it's a low cost way to make this code faster.
There are a couple of weird things though. You have currently about 2 bits of bookkeeping overhead per entry (one bit for erasure, one bit for epoch). If you're restricting to a power of two size for the main cache memory, you also may as well do some of the following if the user specifies more space: more generations for finer grained generations; additional fee tracking to preferentially evict low fee items; some kind of mempool pointer/index map to evict signatures on mempool eviction. (preface: I don't like this idea, but for sake of discussion) You could also keep a large and a small cache and look up from both; letting you target size = 2^m + 2^n.
Also, power of two works reasonably well low in the series (1,2,4,8,16,32,64,128) but it seems to be kind of uncomfortably that at say a 1GB cache you have to choose between 1 and 2 GB if you want to increase a little bit. Yes, sqrt, but people have fixed memory sizes so it does kind of matter.
Maybe a better inbetween would be to do as (roughly) follows (should work on GCC, clang, and msvc):
DEFAULT_MAX_SIGCACHE_SIZE = 32;
#ifdef __MSC_VER
#include <intrin.h>
#define __builtin_popcount __popcnt
#endif
bool fast_mod = __builtin_popcount(size) == 1;
compute_hash(E e) {
if (fast_mod)
// compute hash with masks
else
// compute hash with mod
}
and by default not count the bookkeeping bits in the size computation. That way if you pass in a power of two megabytes as the parameter, you get this magic boost in performance, and if you pass in a non power of two if you get what you asked for. The branch shouldn't hurt because it should be easily predicted.
Tested ACK either 3ba2bf4 or 121dfcd I gave some nits about the comments offline. I have extensively tested the performance of this patch. Lock contention on the sig cache is a serious bottleneck to ConnectBlock performance with 8 or more cores. This patch appears to all but eliminate that contention and leads to a 40% improvement in ConnectBlock time for 16 cores. It also allows for further performance improvements that would not see any benefit until this bottleneck was removed. I have also tested the hit rate performance and it is a significant improvement over the existing sigcache in the event of a spam attack, for a long running node, or in the event of a reorg. In addition to reviewing the included tests, I've also run my own tests and done code review to be sure that there are no false positives. |
ACK 453aef4 |
Testing this (OSX 10.12, XCode 8.1):
Errors trying to compile the tests:
|
@fanquake I think I can fix the warnings but I'm curious as to how much that's a goal given that that flag should be deprecated & there are lots of build errors. Happy to push a squashme if you think so. The build errors should be fixable, must be a different include order or something, I need to add a |
@JeremyRubin The warning isn't so much of an issue, was just making a note of it. |
@fanquake please confirm fix when you have a moment. |
@JeremyRubin Everything compiling file now. Can you suggest tests/benchmarks that reviewers can run to test the performance increase here, or is that going to be hard for anyone without a-lot of cores? |
* - The name "allow_erase" is used because the real discard happens later. | ||
* - The name "please_keep" is used because elements may be erased anyways on insert. | ||
* | ||
* @tparam Element should be a POD type that is 32-alignable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the requirement for POD come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the way it is currently written (using swap and move) it should be safe to remove both requirements (POD and 32-alignable).
Prior versions of the code I think were unsafe for that use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I can do a squashme with just that unless you have other small feedbacks)
* | ||
* User Must Guarantee: | ||
* | ||
* 1) Write Requires synchronized access (e.g., a lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why these locks aren't integrated into the cache class? It isn't obvious to be how to correctly synchronize the erase method with the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is a reason.
The locks aren't actually needed at all how the cache is currently used. All that needs to happen is per thread: at the beginning of a block processing, a memory acquire; at the end of block processing a memory release. Additionally, one more lock must be acquired before block processing by the master to ensure that there are no concurrent writers (from addtomempool) and that master must not release until all slaves release as well. This is much better than the repeated acquire/release of locking/unlocking on every single signature operation.
I have written a version of the code which has this semantics, but some feedback from others felt that it was fragile & prone to someone else breaking that behavior down the line (bad for maintainability). In the interest of keeping that optimization available and not introducing a maintenance hazard, I kept the locking where it was in the "layer up".
@fanquake yeah it'll be tough, there's not much speed improvement from losing the erase locks until ~8 cores. What you could test is better memory efficiency by allocating smaller caches (2 4 8 16...) and comparing performance to a node running with that size and the old cache. I'd recommend running one node and peering your test nodes exclusively thought that to ensure nodes are getting the same messages for a given time. But if you're hardware limited that will be hard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits on the code, but I'd be fine with it as-is...more of my comments were about your comments 👍.
std::swap(mem, d.mem); | ||
} | ||
|
||
/** bit_set sets an entry as discardable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You have some trailing whitespace on 3 lines in this file, including this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* that is only thread unsafe on calls to setup. This class bit-packs collection | ||
* flags for memory efficiency. | ||
* | ||
* All operations are std::memory_order_relaxed so external mechanisms must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use release/acquire - its identical instructions on x86 and wont require a full flush anywhere else to be correct (indeed, relaxed is maybe better the way its implemented now, but if we want the read/write locks in the sigcache to be more effecient we might not want to have a full flush there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right it's the same on x86, but it isn't for ARM. We want the operation to be relaxed, so release or acquire are over constrained.
If later changes want to change the memory model, they should do so then. It's hard to say if a full flush is better or worse than lots of release/acquires; that would require benchmarking.
atomic instruction mappings: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
* table, the entry attempted to be inserted is evicted. | ||
* | ||
*/ | ||
inline void insert(Element e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a strange api to require that a copy be made as the parameter in the function, and then rely on std::move to make insertion effecient (which we cant do for uint256, since it doesnt have any non-POD memory). Seems just as good (and much more common) to pass in a const-reference and then just let operator=() handle the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do a std::swap on e so that's why we have a mutable copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::swap is just as slow for uint256, though. Since there is no dynamically-allocated memory we cant speed it up, really.
// case behavior (no intermittent erases) would exceed epoch size, | ||
// with a reasonable minimum scan size. | ||
epoch_heuristic_counter = std::max(1u, std::max(epoch_size / 16, | ||
epoch_size - std::min(epoch_size, epoch_unused_count))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::min should be a NOP. we're in an else(epoch_size > epoch_unused_count).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/** cache implements a cache with properties similar to a cuckoo-set | ||
* | ||
* The cache is able to hold up to (~(uint32_t) 1) elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: technically it can hold up to (~(uint32_t) 1) - 1, because invalid() uses ~(uint32_t) 1, though I think you meant to use 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was a reason I used ~(uint32_t) 1
rather than ~(uint32_t) 0
, but for the life of me can't recall it. Will change it to be that, but would appreciate you to review that there was no reason.
* 1) Write Requires synchronized access (e.g., a lock) | ||
* 2) Read Requires no concurrent Write, synchronized with the last insert. | ||
* 3) Erase requires no concurrent Write, synchronized with last insert. | ||
* 4) An Eraser must release all Erases before allowing a new Writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this also refers to an old/different version of this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct? I'll try to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, indeed, I believe I was mistaken.
* @tparam Element should be a POD type that is 32-alignable | ||
* @tparam Hash should be a function/callable which takes a template parameter | ||
* hash_select and an Element and extracts a hash from it. Should return | ||
* high-entropy hashes for `Hash h; h<0>(e) and h<1>(e)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also now out-of-date: you're using 8 hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/** epoch_heuristic_counter is used to determine when a epoch | ||
* might be aged & an expensive scan should be done. | ||
* epoch_heuristic_counter is incremented on insert and reset to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/incremented/decremented/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
*/ | ||
void epoch_check() | ||
{ | ||
if ((--epoch_heuristic_counter) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: technically this is an off-by-one: because you check this prior to actually doing the insert you'll always do an "expensive" scan twice for the first (and all) epochs. Practically, you might just want to set epoch_heuristic_count to something like epoch_size + 10 by default, though I suppose it doesnt matter all that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- I fixed it, I think, by decrementing inside the if block. (didn't want to do postfix to stop the counter from underflowing)
|
||
/** insert loops at most depth_limit times trying to insert a hash | ||
* at various locations in the table via a variant of the Cuckoo Algorithm | ||
* with two hash locations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/two/eight/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@JeremyRubin Care to have a look at @TheBlueMatt's comments above? |
Should be all addressed. |
LGTM at 88b58d3 |
reACK 88b58d3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK, will test. Few comment nits/questions below.
Also, it would be nice if we had a unit test for the sigcache itself...
* | ||
* 1) On first iter, always false so defaults to locs[0] | ||
* 2) Second iter, last_loc == locs[0] so will go to locs[1] | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs to be updated now that we have more than 2 hashes, right? If I understand right, the algorithm now is:
Swap with the element one past the last one looked at. Example:
1) On first iter, always false so defaults to locs[0].
2) Second iter, last_loc == locs[k] for some k, so will go to locs[k+1] (wrapping back to 0 if necessary).
*/ | ||
inline bool contains(const Element& e, const bool erase) const | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: random newline
* | ||
* This may exhibit platform endian dependent behavior but because these are | ||
* nonced hashes (random) and this state is only ever used locally it is safe. | ||
* All that matter is local consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "matter" -> "matters"
} | ||
}; | ||
|
||
// Initialized outisde of VerifySignature to avoid atomic operation per call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you clarify this comment -- what is the atomic operation you're referring to?
Also, it seems like this comment is here to explain why you moved the code from its old place to the new place (which is helpful for me to understand!), but it's perhaps not very helpful here in its current form to future code readers who haven't seen the previous implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that a function local static initialized uses an atomic (perhaps even a locking!) operation under the hood for the case where concurrent callers enter the function initially (C++11 guarantees this to be safe).
I also believe that this is not the case for a global static (initialized before concurrency is allowed).
I'm not 100% certain the spec forces conformity on this point, but I think it is at least the case in most implementations.
@sdaftuar, d4294ad should fix your documentation concerns. RE: sigcache unit tests, I could whip something up, but I think it's a little bit hard to test meaningfully (i.e., beyond what other tests using the |
Tested ACK 88b58d3, and the comments in d4294ad look good too, thanks. Re: sigcache tests, I don't know the best way to address. If I could come up with a test suite that exercised the logic sufficiently to make changes like this safer, then I'd happily suggest that, but it seems like a hard problem. Still, after reviewing I'm comfortable enough with this PR to proceed without additional sigcache tests (and it's great that we have tests for the cuckoocache at least). |
utACK. Please squash? :) |
SQUASHME: Change cuckoocache to only work for powers of two, to avoid mod operator SQUASHME: Update Documentation and simplify logarithm logic SQUASHME: OSX Build Errors SQUASHME: minor Feedback from sipa + bluematt SQUASHME: DOCONLY: Clarify a few comments.
SQUASHME: Update Tests for other SQUASHMEs
d4294ad
to
67dac4e
Compare
@sipa Squashed down to just cache and testing commits. unsquashed version preserved at https://github.com/JeremyRubin/bitcoin/tree/cuckoocache-pull-request-unsquashed |
cuckoocache reference: bitcoin/bitcoin#8895 Signed-off-by: k155 <216k155@luxcore.io>
2749d0a build: Include cuckoocache header in Makefile (furszy) 791a51f Deduplicate SignatureCacheHasher (Jeremy Rubin) 7a5918a Decrease testcase sizes in cuckoocache tests (Jeremy Rubin) 8ef3cc9 Trivial: fix comments referencing AppInit2 (Marko Bencun) f21b122 Ensure `-maxsigcachesize` is in valid range (John Newbery) 464b922 Add unit tests for the CuckooCache (Jeremy Rubin) 77e3b1e Add CuckooCache implementation and replace the sigcache map_type with it (Jeremy Rubin) Pull request description: Built on top of #1668. Part of the back port series to get us closer to upstream's sigcache current state. The main ones are: * bitcoin#8895 * bitcoin#9480 * bitcoin#9393 ACKs for top commit: random-zebra: Really nice backport. Tested ACK 2749d0a Fuzzbawls: utACK 2749d0a Tree-SHA512: 22f760edc928790f29a7bbd6e907a561b7ad266b6cc5e85f207206e0f39ae3e4328ec57c0be772c96036a15559b13f6956bea5762cace10f3a560090aeb6758b
edited: This PR has been squashed with a few revisions from the initial PR. The unsquashed version is still available here master...JeremyRubin:cuckoocache-pull-request-not-squashed. The PR message has also been updated to reflect the current squashed version more closely.
Summary
This PR replaces the boost::unordered_set in sigcache with CuckooCache::cache.
Benefits
Design
The CuckooCache uses a design similar to a Cuckoo Hash Table. Each key inserted has a hash function 0 through hash function 7 (h0..h7).
More information about the synchronization specification is in the code documentation.
Generation Heuristic
A bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.
There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.
Simulation
These simulation results are slightly outdated, @morcos will chime in below with some updated figures. It should be better/the same as below.
This can overall be seen as a 40% improvement on validation times over master on a 16 core machine simulated over a month. The simulation methodology sends relayed transactions and blocks to the node in the order they were received historically. On 4 and 1 core simulations, the improvement is not significant (but no worse). Our tests didn't really hit hard saturation on the cache, which suggests that the megabytes allocated to the cache could be reduced.Testing
The PR includes a test suite which checks a few key properties:
Future Work
There are future optimizations which may be able to bring some of the advantages seen with 16 cores to 4 and 1 cores, but this PR is focused on a minimal complexity patch with a demonstrable improvement over main as a base for future work in this direction.
There are also a few cleanup related items (such as cache initialization) which are not done in this patch to make it a minimal code change. These can be fixed in a later PR if this is accepted.
Acknowledgements
Thanks to @morcos, @sdaftuar, and @TheBlueMatt for their assistance and review on this patch.