-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Improve CRollingBloomFilter performance: replace modulus with FastMod #13176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replaces the slow modulo operation with a much faster 32bit multiplication & shift. This works because the hash should be uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod Be aware that this changes the position of the bits that are toggled, so this should probably not be used for CBloomFilter which is serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 9aac9f9
Using FastMod instead of '%' looks correct here, and it indeed can't be used for normal bloom filters (which are normative for the BIP37 protocol).
Changed the tag: as the result of It looks like a feasible alternative in this case (h should be evenly distributed 0..2^32-1 by MurmurHash), and ~25% speed-up is nice. utACK 9aac9f9 |
// A replacement for x % n. This assumes that x and n are 32bit integers, and x is a uniformly random distributed 32bit value | ||
// which should be the case for a good hash. | ||
// See https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ | ||
static inline uint32_t FastMod(uint32_t x, size_t n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could replace inline
with constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that make a difference in the generated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very unlikely - the sort of optimizations that apply to constexpr things also apply to things which the compiler can infer being constant expressions (which is obvious here).
The advantage to using constexpr
is making sure those optimizations aren't made impossible for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's unlikely. My inclination is to minimize surface area and maximally restrain the code to reduce the space of errors that might occur. It's not always meaningful, but it can be a helpful precaution and can pay incidental dividends, e.g. by surfacing changes to the use of variables over time that were not intended. But yeah, it's a nit rather than an objection.
utACK 9aac9f9 |
Perhaps share code and or description with cuckoocache.h compute_hashes? |
@gmaxwell oh right, compute_hashes does the same. What would the preferred way to remove this duplication, refactor cuckoocache.h and to use a static method fast_mod, include & use that in bloom.cpp? |
Normally sharing common code is a good thing but I'm not sure that is worth it here, this is only a few lines, and would involve sharing code between otherwise unrelated units. Would be different if e.g. |
…s with FastMod 9aac9f9 replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a
… modulus with FastMod 9aac9f9 replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a
* Merge bitcoin#13176: Improve CRollingBloomFilter performance: replace modulus with FastMod 9aac9f9 replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a * Use unordered_map in CSporkManager In one of my profiling sessions with many InstantSend transactions happening, calls into CSporkManager added up to about 1% of total CPU time. This is easily avoidable by using unordered maps. * Use std::unordered_map instead of std::map in limitedmap * Use unordered_set for CNode::setAskFor * Add serialization support for unordered maps and sets * Use unordered_map for mapArgs and mapMultiArgs * Let limitedmap prune in batches and use unordered_multimap Due to the batched pruning, there is no need to maintain an ordered map of values anymore. Only when nPruneAfterSize, there is a need to create a temporary ordered vector of values to figure out what can be removed. * Instead of using a multimap for mapAskFor, use a vector which we sort on demand CNode::AskFor will now push entries into an initially unordered vector instead of an ordered multimap. Only when we later want to use vecAskFor in SendMessages, we sort the vector. The vector will actually be mostly sorted in most cases as insertion order usually mimics the desired ordering. Only the last few entries might need some shuffling around. Doing the sort on-demand should be less wasteful then trying to maintain correct order all the time. * Fix compilation of tests * Fix limitedmap tests * Rename limitedmap to unordered_limitedmap to ensure backports conflict This ensures that future backports that depends on limitedmap's ordering conflict so that we are made aware of needed action. * Fix compilation error on Travis
…s with FastMod Summary: 9aac9f9 replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a Backport of Core PR13176 bitcoin/bitcoin#13176 Test Plan: make check src/bench/bench_bitcoin -filter=RollingBloom Repeat above for master and compare. Before change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 5.16318, 6.61227e-07, 7.3991e-07, 6.70607e-07 After change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 3.73982, 4.92548e-07, 5.10237e-07, 4.95271e-07 Reviewers: deadalnix, Fabien, jasonbcox, O1 Bitcoin ABC, #bitcoin_abc Reviewed By: deadalnix, O1 Bitcoin ABC, #bitcoin_abc Differential Revision: https://reviews.bitcoinabc.org/D4160
…s with FastMod Summary: 9aac9f90d5e56752cc6cbfac48063ad29a01143c replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a Backport of Core PR13176 bitcoin/bitcoin#13176 Test Plan: make check src/bench/bench_bitcoin -filter=RollingBloom Repeat above for master and compare. Before change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 5.16318, 6.61227e-07, 7.3991e-07, 6.70607e-07 After change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 3.73982, 4.92548e-07, 5.10237e-07, 4.95271e-07 Reviewers: deadalnix, Fabien, jasonbcox, O1 Bitcoin ABC, #bitcoin_abc Reviewed By: deadalnix, O1 Bitcoin ABC, #bitcoin_abc Differential Revision: https://reviews.bitcoinabc.org/D4160
…s with FastMod Summary: 9aac9f90d5e56752cc6cbfac48063ad29a01143c replace modulus with FastMod (Martin Ankerl) Pull request description: Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3: ``` RollingBloom, 5, 1500000, 3.73733, 4.97569e-07, 4.99002e-07, 4.98372e-07 # before RollingBloom, 5, 1500000, 2.86842, 3.81630e-07, 3.83730e-07, 3.82473e-07 # FastMod ``` Be aware that this changes the internal data of the filter, so this should probably not be used for CBloomFilter because of interoperability problems. Tree-SHA512: 04104f3fb09f56c9d14458a6aad919aeb0a5af944e8ee6a31f00e93c753e22004648c1cd65bf36752b6addec528d19fb665c27b955ce1666a85a928e17afa47a Backport of Core PR13176 bitcoin/bitcoin#13176 Test Plan: make check src/bench/bench_bitcoin -filter=RollingBloom Repeat above for master and compare. Before change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 5.16318, 6.61227e-07, 7.3991e-07, 6.70607e-07 After change: Benchmark, evals, iterations, total, min, max, median RollingBloom, 5, 1500000, 3.73982, 4.92548e-07, 5.10237e-07, 4.95271e-07 Reviewers: deadalnix, Fabien, jasonbcox, O1 Bitcoin ABC, #bitcoin_abc Reviewed By: deadalnix, O1 Bitcoin ABC, #bitcoin_abc Differential Revision: https://reviews.bitcoinabc.org/D4160
Backport bloom filter improvements Cherry-picked from the following upstream PRs: - bitcoin/bitcoin#7113 - bitcoin/bitcoin#7818 - Only the second commit (to resolve conflicts). - bitcoin/bitcoin#7934 - bitcoin/bitcoin#8655 - Partial backport to help resolve conflicts. - bitcoin/bitcoin#9060 - bitcoin/bitcoin#9223 - bitcoin/bitcoin#9644 - Partial backport to help resolve conflicts. - bitcoin/bitcoin#9916 - bitcoin/bitcoin#9750 - bitcoin/bitcoin#13176 - bitcoin/bitcoin#13948 - bitcoin/bitcoin#16073 - bitcoin/bitcoin#18670 - bitcoin/bitcoin#18806 - Reveals upstream's covert fix for CVE-2013-5700. - bitcoin/bitcoin#19968
Backport bloom filter improvements Cherry-picked from the following upstream PRs: - bitcoin/bitcoin#7113 - bitcoin/bitcoin#7818 - Only the second commit (to resolve conflicts). - bitcoin/bitcoin#7934 - bitcoin/bitcoin#8655 - Partial backport to help resolve conflicts. - bitcoin/bitcoin#9060 - bitcoin/bitcoin#9223 - bitcoin/bitcoin#9644 - Partial backport to help resolve conflicts. - bitcoin/bitcoin#9916 - bitcoin/bitcoin#9750 - bitcoin/bitcoin#13176 - bitcoin/bitcoin#13948 - bitcoin/bitcoin#16073 - bitcoin/bitcoin#18670 - bitcoin/bitcoin#18806 - Reveals upstream's covert fix for CVE-2013-5700. - bitcoin/bitcoin#19968
Not sure if this is optimization is necessary, but anyway I have some spare time so here it is. This replaces the slow modulo operation with a much faster 64bit multiplication & shift. This works when the hash is uniformly distributed between 0 and 2^32-1. This speeds up the benchmark by a factor of about 1.3:
Be aware that this changes the internal data of the filter, so this should probably
not be used for CBloomFilter because of interoperability problems.