-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Lock-Free CheckQueue #9938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock-Free CheckQueue #9938
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I'm curious if you've benchmarked the lock-free version vs just dropping the last commit (possibly after removing the batching, if that was actually a loss). Given the contention on fAllOk (which could be trivially removed since you dont quit early on !fAllOk, btw) and check_mem_top/check_mem_bot, I'm surprised there is much of a win to be had on the last commit.
src/test/checkqueue_tests.cpp
Outdated
static std::atomic<size_t> n_calls; | ||
bool operator()() | ||
{ | ||
++n_calls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the test expose more parallellism, should probably use release/acquire here, no?
src/test/checkqueue_tests.cpp
Outdated
}; | ||
~MemoryCheck(){ | ||
fake_allocated_memory -= b; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace at end of line (and stray line, it looks like).
src/test/checkqueue_tests.cpp
Outdated
CCheckQueueControl<FailingCheck> control(fail_queue.get()); | ||
size_t remaining = i; | ||
while (remaining) { | ||
size_t r = GetRand(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: something > 10 might test parallellism a bit more, no? (and in a few other places)
src/test/checkqueue_tests.cpp
Outdated
control.Add(vChecks); | ||
} | ||
bool r =control.Wait(); | ||
BOOST_REQUIRE(r || end_fails); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to have an XOR here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
src/test/checkqueue_tests.cpp
Outdated
{ | ||
std::unique_lock<std::mutex> l(FrozenCleanupCheck::m); | ||
// Wait until the queue has finished all jobs and frozen | ||
FrozenCleanupCheck::cv.wait(l, [](){return FrozenCleanupCheck::nFrozen == 1;}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe close the "l" scope after this line so that the try_lock()s happen without FrozenCleanupCheck::m?
src/checkqueue.h
Outdated
check_mem.emplace_back(std::forward<Args>(args)...); | ||
} | ||
} | ||
void Flush(size_t s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation for Flush() (and how it interacts with the two versions of Add()) here?
src/checkqueue.h
Outdated
typename std::vector<T>::iterator check_mem; | ||
typename std::vector<T>::iterator check_mem_top; | ||
typename std::vector<T>::iterator check_mem_bottom; | ||
void Setup(typename std::vector<T>::iterator check_mem_in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace at EOL here.
/** The begin and end offsets into check_mem. 128 bytes of padding is | ||
* inserted before and after check_mem_top to eliminate false sharing*/ | ||
std::atomic<uint32_t> check_mem_bot {0}; | ||
unsigned char _padding[128]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ifdef hardware_destructive_interference_size use it?
src/checkqueue.h
Outdated
checks_iterator->swap(t); | ||
std::advance(checks_iterator, 1); | ||
pT->swap(t); | ||
// Can't reveal result until after swapped, otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? Sure we can, we just cant decrement nAwake until we've swap()ed. I assume you had a previous version with an early termination for !fAllOk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; you're correct. Earlier version was like this.
Also termination does still occur "early", all the actual checks are skipped (they are just stilled dequeued). I suppose I could make it abort (may actually be nice to check if it aborted before calling checkinputs... but maybe that's best for another PR).
I'll refactor to something which converges more quickly on abort (e.g., setting !fMasterPresent && !fAllOk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its probably better to not have quick abort and have less between-thread contention, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution i came up with is pretty trivial (I'll push it later today).
fOk = fAllOk.load(std::memory_order_relaxed);
if (fOk) {
T t();
pT->swap(t);
} else {
fAllOk.store(false, std::memory_order_relaxed);
fMasterPresent = false;
// try to consume all values as quickly as possible
while (top_cache > bottom_cache &&
!check_mem_bot.compare_exchange_weak( bottom_cache, top_cache)) {
top_cache = check_mem_top.load();
}
}
Especially since now, fAllOk is never written really now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like too just complexity? Just dont bother exiting early, we can take a performance hit of very little on an invalid block, I think. Just dont bother writing to fAllOk if fOk?
std::atomic<unsigned int> nAwake; | ||
|
||
/** If there is presently a master process either in the queue or adding jobs */ | ||
std::atomic<bool> fMasterPresent; | ||
|
||
//! Whether we're shutting down. | ||
bool fQuit; | ||
|
||
//! The maximum number of elements to be processed in one batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like, in the current version, you can remove nBatchSize as well (though I'm surprised its not a performance gain to batch operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely. I didn't change it to minimize changeset, but it no longer has a use.
Update code to pass MAX_SCRIPTCHECKS_PER_BLOCK to CCheckQueueControl
… commit, checks were stored in three separate vectors. First, a vector was made locally to pass to a call to CCheckQueueControl::Add. In that call, checks were added to the main CCheckQueue memory, then each thread had to locally (per batch) put the check into a new, local vector. This means the location of the Check moved 2 times. In this patch, the check is created in a buffer associated with the CCheckQueueControl. Instead of moving the check, a counter is incremented to keep track of which threads have claimed which checks. Also updates benchmark to use new in-place CheckQueue API
…free" style to improve the concurrency on multicore machines. In effect, it eliminates the use of any "critical section" during block validation.
6e24aa8
to
39397bb
Compare
src/checkqueue.h
Outdated
for (unsigned int i = 0; i < nNow && fOk; i++) { | ||
fOk = (*checks_iterator)(); | ||
// execute work | ||
fOk &= fOk && (*pT)(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not:
fOk &= fOk && T(std::move(*pT))()
and do away with all of the swap business?
…). Fix a test to check the exclusive or of two properties rather than just or.
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in #9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to #10026 and some feedback on #9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ #10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see #9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
Going to close this for now; it is a good target for future work if someone wants to take another crack at it. I have a number of further relaxations/tweaks (which need more review) to this PR available here JeremyRubin#1 if anyone is taking it over, they should review those too. |
Should be tagged up for grabs |
…). Fix a test to check the exclusive or of two properties rather than just or.
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
Thanks @JeremyRubin for the really nice exposition of the issues in this PR description, found it helpful for reviewing #18710. |
bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov) 6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov) dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov) 0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov) 0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov) Pull request description: This PR: - gets rid of `boost::thread_group` in the `CCheckQueue` class - allows thread safety annotation usage in the `CCheckQueue` class - is alternative to #14464 (#18710 (comment), #18710 (comment)) Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from #9938. Related: #17307 ACKs for top commit: laanwj: Code review ACK bb6fcc7 LarryRuane: ACK bb6fcc7 jonatack: Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1
bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov) 6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov) dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov) 0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov) 0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov) Pull request description: This PR: - gets rid of `boost::thread_group` in the `CCheckQueue` class - allows thread safety annotation usage in the `CCheckQueue` class - is alternative to bitcoin#14464 (bitcoin#18710 (comment), bitcoin#18710 (comment)) Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from bitcoin#9938. Related: bitcoin#17307 ACKs for top commit: laanwj: Code review ACK bb6fcc7 LarryRuane: ACK bb6fcc7 jonatack: Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
TL;DR: This PR introduces a new hopefully easy-to-review Lock-Free CheckQueue algorithm. It's really fast!
Summary & Problem Statement
In Bitcoin-Qt 0.8, parallel script validation was introduced to improve the speed at which a block could be validated. The basic way in which this worked is that the master thread creates a task for every script of each input in a transaction being checked, and enqueues it to a work queue. After adding all tasks, the master also becomes a worker. Each worker thread claims a number of tasks from the queue (based on how many tasks are available and how many workers are available) and then runs each task, reporting if any errors were encountered. The core approach has not changed significantly since 0.8. The current approach is deficient for four main reasons which these changes address over 2 major commits (and 1 minor one introducing API change only, and 2 others covered in previous PRs).
Deficiencies in Current CheckQueue & Solutions
1. Memory Instability:
Each task must be moved to three different memory locations during its lifetime:
This also makes it difficult to write multithreaded code, because memory is modified more than once during validation.
Solution:
The new algorithm uses stable memory: during block validation, enough memory for the worst case number of checks is allocated, and then tasks are directly placed and read out of that memory. Instead, each worker thread claims a pointer to the task. See 89a1f93
2. Lock-Heavy algorithm:
In the original algorithm, each worker thread and the master contend for a single lock for enqueuing and dequeuing tasks. This is highly inefficient, each worker spends significant amount of time acquiring the lock to be able to dequeue a task, and the master spends time waiting to enqueue tasks.
Solution:
The new algorithm is Lock-Free; during block validation no locks are taken for enqueuing or dequeuing tasks, only atomic memory is read or written. The use of a different piece of atomic memory for enqueuing and most dequeuing operations means that the master and worker threads infrequently interact during validation, meaning low contention. See 6e24aa8
3. Sleeping during validation:
In the original algorithm, while waiting for a task to be enqueued by the master, each thread sleeps. This means that the OS’s scheduler might choose to pre-empt the thread, thrashing it’s cache with another process’s memory. Furthermore, the latency for a wake up operation is high, as it requires the operating system to awaken a thread.
Solution:
the new algorithm only sleeps before and after block validation, and busy-spins waiting for tasks until the master joins and there are no more tasks available (then sleeps). See 6e24aa8
4. Arbitrary batch size selection:
The original batch size algorithm was selected as an experimental tuning for reasonable performance. If the batch size is too small, workers spend too much time getting a new task and not enough time running the tasks. The problem with any form of batching is that you may run into an imbalanced case where one worker has a batch (say of size five) and another worker has no tasks. In order to improve that, a batch size of one is optimal.
Solution:
The new algorithm uses a batch size of one, so that all threads finish their work at as close to the same time as possible. See 6e24aa8.
Reviewing this PR
Code Organization
In writing this PR, I used many small-step iterations to get to the final design. While I am fairly confident that the commits in this PR should be easily reviewable, I have left the unsquashed version available here for anyone to reference
https://github.com/JeremyRubin/bitcoin/tree/PR-lockfree-checkqueue-unsquashed
Testing
#9497 introduces a fairly comprehensive set of tests, which these changes pass, as well as the earlier CheckQueue. See #9497 for details on the conditions tested.
Performance
Test results are from a 2011 MacBook Pro 13", 2.7 GHz Intel Core i7, 8 GB 1333 MHz DDR3.
MicroBenchmark Performance (2.9 to 5.7X faster)
#9498 introduced two new microbenchmarks for the CheckQueue. One which tests tasks which are empty, and another which tests tasks that do a little bit of work. These are microbenchmarks, so they need to be taken with a grain of salt.
So we see for trivial jobs, it's about 5.7X faster, and for more involved jobs, about 2.9X faster.
Test Performance (10X faster)
I have observed very nice performance improvements when it comes to how long it takes to run the checkqueue_tests.
So we see about a 10x performance improvement here.
Cross Platform
It would be good to have reviewers comment with cross platform performance testing, as some of the lock free instructions may be slower on some platforms than others, and I don't have access to a large enough swath of machines. If you're reviewing performance, you may find the benchdiff.py tool I wrote useful JeremyRubin@14aa19a.
Real World/Simulated Numbers
I don't have great real-world-node numbers yet for these changes, but I'll wait for others (e.g., @morcos) to report back on those.
edit: I have run a simulation on a month of block validation on a 4-core machine and seen no difference in aggregate performance. I'll poke around to see if the design can be tweaked a little bit for some advantage with fewer cores, but the more important simulations are for machines with >7 cores or multiple cpus, where the contention relieved by this PR becomes more significant.
Notes
This builds on #9497 (and #9495). There were a couple minor nits on those PR's outstanding, but I figure they can be addressed here later with a squashme (rather than having to squash those, get them re-reviewed, and then issue this PR). An earlier (and worse) attempt at a similar design can be seen here for comparison #8464.
Acknowledgements
Thanks to @morcos and @sdaftuar for supporting and helping in the development of this work, and to various others (@TheBlueMatt, @theuni, and others) for review in feedback at various stages of this project.