Lock-Free CheckQueue #9938

JeremyRubin · 2017-03-07T12:20:29Z

TL;DR: This PR introduces a new hopefully easy-to-review Lock-Free CheckQueue algorithm. It's really fast!

Summary & Problem Statement

In Bitcoin-Qt 0.8, parallel script validation was introduced to improve the speed at which a block could be validated. The basic way in which this worked is that the master thread creates a task for every script of each input in a transaction being checked, and enqueues it to a work queue. After adding all tasks, the master also becomes a worker. Each worker thread claims a number of tasks from the queue (based on how many tasks are available and how many workers are available) and then runs each task, reporting if any errors were encountered. The core approach has not changed significantly since 0.8. The current approach is deficient for four main reasons which these changes address over 2 major commits (and 1 minor one introducing API change only, and 2 others covered in previous PRs).

Deficiencies in Current CheckQueue & Solutions

1. Memory Instability:

Each task must be moved to three different memory locations during its lifetime:

the place where the master creates the task.
the shared work queue.
the worker-local work queue.
This also makes it difficult to write multithreaded code, because memory is modified more than once during validation.

Solution:

The new algorithm uses stable memory: during block validation, enough memory for the worst case number of checks is allocated, and then tasks are directly placed and read out of that memory. Instead, each worker thread claims a pointer to the task. See 89a1f93

2. Lock-Heavy algorithm:

In the original algorithm, each worker thread and the master contend for a single lock for enqueuing and dequeuing tasks. This is highly inefficient, each worker spends significant amount of time acquiring the lock to be able to dequeue a task, and the master spends time waiting to enqueue tasks.

Solution:

The new algorithm is Lock-Free; during block validation no locks are taken for enqueuing or dequeuing tasks, only atomic memory is read or written. The use of a different piece of atomic memory for enqueuing and most dequeuing operations means that the master and worker threads infrequently interact during validation, meaning low contention. See 6e24aa8

3. Sleeping during validation:

In the original algorithm, while waiting for a task to be enqueued by the master, each thread sleeps. This means that the OS’s scheduler might choose to pre-empt the thread, thrashing it’s cache with another process’s memory. Furthermore, the latency for a wake up operation is high, as it requires the operating system to awaken a thread.

Solution:

the new algorithm only sleeps before and after block validation, and busy-spins waiting for tasks until the master joins and there are no more tasks available (then sleeps). See 6e24aa8

4. Arbitrary batch size selection:

The original batch size algorithm was selected as an experimental tuning for reasonable performance. If the batch size is too small, workers spend too much time getting a new task and not enough time running the tasks. The problem with any form of batching is that you may run into an imbalanced case where one worker has a batch (say of size five) and another worker has no tasks. In order to improve that, a batch size of one is optimal.

Solution:

The new algorithm uses a batch size of one, so that all threads finish their work at as close to the same time as possible. See 6e24aa8.

Reviewing this PR

Code Organization

In writing this PR, I used many small-step iterations to get to the final design. While I am fairly confident that the commits in this PR should be easily reviewable, I have left the unsquashed version available here for anyone to reference
https://github.com/JeremyRubin/bitcoin/tree/PR-lockfree-checkqueue-unsquashed

Testing

#9497 introduces a fairly comprehensive set of tests, which these changes pass, as well as the earlier CheckQueue. See #9497 for details on the conditions tested.

Performance

Test results are from a 2011 MacBook Pro 13", 2.7 GHz Intel Core i7, 8 GB 1333 MHz DDR3.

MicroBenchmark Performance (2.9 to 5.7X faster)

#9498 introduced two new microbenchmarks for the CheckQueue. One which tests tasks which are empty, and another which tests tasks that do a little bit of work. These are microbenchmarks, so they need to be taken with a grain of salt.

#Before (7ff4a538a8682cdf02a4bcd6f15499c841001b73)
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
CCheckQueueSpeed,896,0.001128047704697,0.001315370202065,0.001167648338846,3038807,3543440,3145483
CCheckQueueSpeedPrevectorJob,320,0.002763845026493,0.003494121134281,0.003255434334278,7445473,9415834,8770003

#After (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
CCheckQueueSpeed,5120,0.000198634807020,0.000226773321629,0.000202708039433,535092,610897,546067
CCheckQueueSpeedPrevectorJob,896,0.000990316271782,0.001982234418392,0.001142452071820,2667680,5339862,3077721

So we see for trivial jobs, it's about 5.7X faster, and for more involved jobs, about 2.9X faster.

Test Performance (10X faster)

I have observed very nice performance improvements when it comes to how long it takes to run the checkqueue_tests.

#before (08e4e1ea89427a2594415d0b37011692a5109c39)
$ time ./test_bitcoin -t checkqueue_tests
./test/test_bitcoin -t checkqueue_tests  12.39s user 58.12s system 116% cpu 1:00.31 total
#after (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
$ time ./test_bitcoin -t checkqueue_tests
./test/test_bitcoin -t checkqueue_tests  3.43s user 1.40s system 78% cpu 6.180 total

So we see about a 10x performance improvement here.

Cross Platform

It would be good to have reviewers comment with cross platform performance testing, as some of the lock free instructions may be slower on some platforms than others, and I don't have access to a large enough swath of machines. If you're reviewing performance, you may find the benchdiff.py tool I wrote useful JeremyRubin@14aa19a.

Real World/Simulated Numbers

I don't have great real-world-node numbers yet for these changes, but I'll wait for others (e.g., @morcos) to report back on those.

edit: I have run a simulation on a month of block validation on a 4-core machine and seen no difference in aggregate performance. I'll poke around to see if the design can be tweaked a little bit for some advantage with fewer cores, but the more important simulations are for machines with >7 cores or multiple cpus, where the contention relieved by this PR becomes more significant.

Notes

This builds on #9497 (and #9495). There were a couple minor nits on those PR's outstanding, but I figure they can be addressed here later with a squashme (rather than having to squash those, get them re-reviewed, and then issue this PR). An earlier (and worse) attempt at a similar design can be seen here for comparison #8464.

Acknowledgements

Thanks to @morcos and @sdaftuar for supporting and helping in the development of this work, and to various others (@TheBlueMatt, @theuni, and others) for review in feedback at various stages of this project.

TheBlueMatt

Generally looks good. I'm curious if you've benchmarked the lock-free version vs just dropping the last commit (possibly after removing the batching, if that was actually a loss). Given the contention on fAllOk (which could be trivially removed since you dont quit early on !fAllOk, btw) and check_mem_top/check_mem_bot, I'm surprised there is much of a win to be had on the last commit.

TheBlueMatt · 2017-03-08T21:39:33Z

src/test/checkqueue_tests.cpp

+    static std::atomic<size_t> n_calls;
+    bool operator()()
+    {
+        ++n_calls;


To make the test expose more parallellism, should probably use release/acquire here, no?

TheBlueMatt · 2017-03-08T21:42:00Z

src/test/checkqueue_tests.cpp

+    };
+    ~MemoryCheck(){
+        fake_allocated_memory -= b;
+


nit: whitespace at end of line (and stray line, it looks like).

TheBlueMatt · 2017-03-08T21:45:26Z

src/test/checkqueue_tests.cpp

+        CCheckQueueControl<FailingCheck> control(fail_queue.get());
+        size_t remaining = i;
+        while (remaining) {
+            size_t r = GetRand(10);


nit: something > 10 might test parallellism a bit more, no? (and in a few other places)

TheBlueMatt · 2017-03-08T21:49:30Z

src/test/checkqueue_tests.cpp

+                control.Add(vChecks);
+            }
+            bool r =control.Wait();
+            BOOST_REQUIRE(r || end_fails);


I think you meant to have an XOR here.

TheBlueMatt · 2017-03-08T22:13:51Z

src/test/checkqueue_tests.cpp

+    {
+        std::unique_lock<std::mutex> l(FrozenCleanupCheck::m);
+        // Wait until the queue has finished all jobs and frozen
+        FrozenCleanupCheck::cv.wait(l, [](){return FrozenCleanupCheck::nFrozen == 1;});


nit: maybe close the "l" scope after this line so that the try_lock()s happen without FrozenCleanupCheck::m?

TheBlueMatt · 2017-03-08T22:38:42Z

src/checkqueue.h

+            check_mem.emplace_back(std::forward<Args>(args)...);
+        }
+    }
+    void Flush(size_t s)


Can you add some documentation for Flush() (and how it interacts with the two versions of Add()) here?

TheBlueMatt · 2017-03-08T22:39:00Z

src/checkqueue.h

+    typename std::vector<T>::iterator check_mem;
+    typename std::vector<T>::iterator check_mem_top;
+    typename std::vector<T>::iterator check_mem_bottom;
+    void Setup(typename std::vector<T>::iterator check_mem_in) 


nit: whitespace at EOL here.

TheBlueMatt · 2017-03-08T22:46:17Z

src/checkqueue.h

+    /** The begin and end offsets into check_mem. 128 bytes of padding is
+     * inserted before and after check_mem_top to eliminate false sharing*/
+    std::atomic<uint32_t> check_mem_bot {0};
+    unsigned char _padding[128];


Maybe ifdef hardware_destructive_interference_size use it?

TheBlueMatt · 2017-03-08T22:57:40Z

src/checkqueue.h

-                checks_iterator->swap(t);
-                std::advance(checks_iterator, 1);
+                pT->swap(t);
+                // Can't reveal result until after swapped, otherwise


Huh? Sure we can, we just cant decrement nAwake until we've swap()ed. I assume you had a previous version with an early termination for !fAllOk?

Yes; you're correct. Earlier version was like this.

Also termination does still occur "early", all the actual checks are skipped (they are just stilled dequeued). I suppose I could make it abort (may actually be nice to check if it aborted before calling checkinputs... but maybe that's best for another PR).

I'll refactor to something which converges more quickly on abort (e.g., setting !fMasterPresent && !fAllOk)

Its probably better to not have quick abort and have less between-thread contention, no?

I think the solution i came up with is pretty trivial (I'll push it later today).

fOk = fAllOk.load(std::memory_order_relaxed); if (fOk) { T t(); pT->swap(t); } else { fAllOk.store(false, std::memory_order_relaxed); fMasterPresent = false; // try to consume all values as quickly as possible while (top_cache > bottom_cache && !check_mem_bot.compare_exchange_weak( bottom_cache, top_cache)) { top_cache = check_mem_top.load(); } }

Especially since now, fAllOk is never written really now.

Looks like too just complexity? Just dont bother exiting early, we can take a performance hit of very little on an invalid block, I think. Just dont bother writing to fAllOk if fOk?

TheBlueMatt · 2017-03-08T23:05:01Z

src/checkqueue.h

+    std::atomic<unsigned int> nAwake;
+
+    /** If there is presently a master process either in the queue or adding jobs */
+    std::atomic<bool> fMasterPresent;

    //! Whether we're shutting down.
    bool fQuit;

    //! The maximum number of elements to be processed in one batch


Looks like, in the current version, you can remove nBatchSize as well (though I'm surprised its not a performance gain to batch operations?

Yes, absolutely. I didn't change it to minimize changeset, but it no longer has a use.

Update code to pass MAX_SCRIPTCHECKS_PER_BLOCK to CCheckQueueControl

… commit, checks were stored in three separate vectors. First, a vector was made locally to pass to a call to CCheckQueueControl::Add. In that call, checks were added to the main CCheckQueue memory, then each thread had to locally (per batch) put the check into a new, local vector. This means the location of the Check moved 2 times. In this patch, the check is created in a buffer associated with the CCheckQueueControl. Instead of moving the check, a counter is incremented to keep track of which threads have claimed which checks. Also updates benchmark to use new in-place CheckQueue API

…free" style to improve the concurrency on multicore machines. In effect, it eliminates the use of any "critical section" during block validation.

theuni · 2017-03-27T20:54:16Z

src/checkqueue.h

-            for (unsigned int i = 0; i < nNow && fOk; i++) {
-                fOk = (*checks_iterator)();
+                // execute work
+                fOk &= fOk && (*pT)();


Why not:

fOk &= fOk && T(std::move(*pT))()

and do away with all of the swap business?

…). Fix a test to check the exclusive or of two properties rather than just or.

8c2f4b8 Expose more parallelism with relaxed atomics (suggested in #9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to #10026 and some feedback on #9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ #10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see #9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00

JeremyRubin · 2018-03-06T18:07:09Z

Going to close this for now; it is a good target for future work if someone wants to take another crack at it.

I have a number of further relaxations/tweaks (which need more review) to this PR available here JeremyRubin#1 if anyone is taking it over, they should review those too.

ryanofsky · 2018-03-06T18:34:39Z

Should be tagged up for grabs

…). Fix a test to check the exclusive or of two properties rather than just or.

8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00

jonatack · 2021-01-11T08:55:11Z

Thanks @JeremyRubin for the really nice exposition of the issues in this PR description, found it helpful for reviewing #18710.

bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov) 6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov) dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov) 0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov) 0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov) Pull request description: This PR: - gets rid of `boost::thread_group` in the `CCheckQueue` class - allows thread safety annotation usage in the `CCheckQueue` class - is alternative to #14464 (#18710 (comment), #18710 (comment)) Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from #9938. Related: #17307 ACKs for top commit: laanwj: Code review ACK bb6fcc7 LarryRuane: ACK bb6fcc7 jonatack: Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1

bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov) 6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov) dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov) 0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov) 0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov) Pull request description: This PR: - gets rid of `boost::thread_group` in the `CCheckQueue` class - allows thread safety annotation usage in the `CCheckQueue` class - is alternative to bitcoin#14464 (bitcoin#18710 (comment), bitcoin#18710 (comment)) Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from bitcoin#9938. Related: bitcoin#17307 ACKs for top commit: laanwj: Code review ACK bb6fcc7 LarryRuane: ACK bb6fcc7 jonatack: Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1

8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin) Pull request description: This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938. ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~ 1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this 1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine. 1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938). Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00

laanwj added the Validation label Mar 7, 2017

TheBlueMatt reviewed Mar 8, 2017

View reviewed changes

JeremyRubin mentioned this pull request Mar 23, 2017

Overview of slow unit tests #10026

Closed

JeremyRubin added 3 commits March 27, 2017 13:34

CheckQueue: Add size constants to validation.{h,cpp}.

0232c07

Update code to pass MAX_SCRIPTCHECKS_PER_BLOCK to CCheckQueueControl

CheckQueue: This commit refactors the CheckQueue to be written "lock …

39397bb

…free" style to improve the concurrency on multicore machines. In effect, it eliminates the use of any "critical section" during block validation.

JeremyRubin force-pushed the PR-lockfree-checkqueue branch from 6e24aa8 to 39397bb Compare March 27, 2017 17:41

JeremyRubin mentioned this pull request Mar 27, 2017

Slightly Improve Unit Tests for Checkqueue #10099

Merged

theuni reviewed Mar 27, 2017

View reviewed changes

JeremyRubin added a commit to JeremyRubin/bitcoin that referenced this pull request Aug 9, 2017

Expose more parallelism with relaxed atomics (suggested in bitcoin#9938…

8c2f4b8

…). Fix a test to check the exclusive or of two properties rather than just or.

JeremyRubin closed this Mar 6, 2018

fanquake added the Up for grabs label Mar 6, 2018

HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request Mar 12, 2018

Expose more parallelism with relaxed atomics (suggested in bitcoin#9938…

d83b93f

…). Fix a test to check the exclusive or of two properties rather than just or.

hebasto mentioned this pull request Apr 20, 2020

Add local thread pool to CCheckQueue #18710

Merged

adamjonas removed the Up for grabs label Feb 26, 2021

bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022

Lock-Free CheckQueue #9938

Lock-Free CheckQueue #9938

Uh oh!

Conversation

JeremyRubin commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary & Problem Statement

Deficiencies in Current CheckQueue & Solutions

1. Memory Instability:

Solution:

2. Lock-Heavy algorithm:

Solution:

3. Sleeping during validation:

Solution:

4. Arbitrary batch size selection:

Solution:

Reviewing this PR

Code Organization

Testing

Performance

MicroBenchmark Performance (2.9 to 5.7X faster)

Test Performance (10X faster)

Cross Platform

Real World/Simulated Numbers

Notes

Acknowledgements

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeremyRubin Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeremyRubin commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanofsky commented Mar 6, 2018

Uh oh!

jonatack commented Jan 11, 2021

Uh oh!

Uh oh!

JeremyRubin commented Mar 7, 2017 •

edited

Loading

JeremyRubin Mar 9, 2017 •

edited

Loading

JeremyRubin commented Mar 6, 2018 •

edited

Loading