Skip to content

Conversation

JeremyRubin
Copy link
Contributor

@JeremyRubin JeremyRubin commented Mar 7, 2017

TL;DR: This PR introduces a new hopefully easy-to-review Lock-Free CheckQueue algorithm. It's really fast!

Summary & Problem Statement

In Bitcoin-Qt 0.8, parallel script validation was introduced to improve the speed at which a block could be validated. The basic way in which this worked is that the master thread creates a task for every script of each input in a transaction being checked, and enqueues it to a work queue. After adding all tasks, the master also becomes a worker. Each worker thread claims a number of tasks from the queue (based on how many tasks are available and how many workers are available) and then runs each task, reporting if any errors were encountered. The core approach has not changed significantly since 0.8. The current approach is deficient for four main reasons which these changes address over 2 major commits (and 1 minor one introducing API change only, and 2 others covered in previous PRs).

Deficiencies in Current CheckQueue & Solutions

1. Memory Instability:

Each task must be moved to three different memory locations during its lifetime:

  1. the place where the master creates the task.
  2. the shared work queue.
  3. the worker-local work queue.
    This also makes it difficult to write multithreaded code, because memory is modified more than once during validation.

Solution:

The new algorithm uses stable memory: during block validation, enough memory for the worst case number of checks is allocated, and then tasks are directly placed and read out of that memory. Instead, each worker thread claims a pointer to the task. See 89a1f93

2. Lock-Heavy algorithm:

In the original algorithm, each worker thread and the master contend for a single lock for enqueuing and dequeuing tasks. This is highly inefficient, each worker spends significant amount of time acquiring the lock to be able to dequeue a task, and the master spends time waiting to enqueue tasks.

Solution:

The new algorithm is Lock-Free; during block validation no locks are taken for enqueuing or dequeuing tasks, only atomic memory is read or written. The use of a different piece of atomic memory for enqueuing and most dequeuing operations means that the master and worker threads infrequently interact during validation, meaning low contention. See 6e24aa8

3. Sleeping during validation:

In the original algorithm, while waiting for a task to be enqueued by the master, each thread sleeps. This means that the OS’s scheduler might choose to pre-empt the thread, thrashing it’s cache with another process’s memory. Furthermore, the latency for a wake up operation is high, as it requires the operating system to awaken a thread.

Solution:

the new algorithm only sleeps before and after block validation, and busy-spins waiting for tasks until the master joins and there are no more tasks available (then sleeps). See 6e24aa8

4. Arbitrary batch size selection:

The original batch size algorithm was selected as an experimental tuning for reasonable performance. If the batch size is too small, workers spend too much time getting a new task and not enough time running the tasks. The problem with any form of batching is that you may run into an imbalanced case where one worker has a batch (say of size five) and another worker has no tasks. In order to improve that, a batch size of one is optimal.

Solution:

The new algorithm uses a batch size of one, so that all threads finish their work at as close to the same time as possible. See 6e24aa8.

Reviewing this PR

Code Organization

In writing this PR, I used many small-step iterations to get to the final design. While I am fairly confident that the commits in this PR should be easily reviewable, I have left the unsquashed version available here for anyone to reference
https://github.com/JeremyRubin/bitcoin/tree/PR-lockfree-checkqueue-unsquashed

Testing

#9497 introduces a fairly comprehensive set of tests, which these changes pass, as well as the earlier CheckQueue. See #9497 for details on the conditions tested.

Performance

Test results are from a 2011 MacBook Pro 13", 2.7 GHz Intel Core i7, 8 GB 1333 MHz DDR3.

MicroBenchmark Performance (2.9 to 5.7X faster)

#9498 introduced two new microbenchmarks for the CheckQueue. One which tests tasks which are empty, and another which tests tasks that do a little bit of work. These are microbenchmarks, so they need to be taken with a grain of salt.

#Before (7ff4a538a8682cdf02a4bcd6f15499c841001b73)
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
CCheckQueueSpeed,896,0.001128047704697,0.001315370202065,0.001167648338846,3038807,3543440,3145483
CCheckQueueSpeedPrevectorJob,320,0.002763845026493,0.003494121134281,0.003255434334278,7445473,9415834,8770003

#After (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
CCheckQueueSpeed,5120,0.000198634807020,0.000226773321629,0.000202708039433,535092,610897,546067
CCheckQueueSpeedPrevectorJob,896,0.000990316271782,0.001982234418392,0.001142452071820,2667680,5339862,3077721

So we see for trivial jobs, it's about 5.7X faster, and for more involved jobs, about 2.9X faster.

Test Performance (10X faster)

I have observed very nice performance improvements when it comes to how long it takes to run the checkqueue_tests.

#before (08e4e1ea89427a2594415d0b37011692a5109c39)
$ time ./test_bitcoin -t checkqueue_tests
./test/test_bitcoin -t checkqueue_tests  12.39s user 58.12s system 116% cpu 1:00.31 total
#after (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
$ time ./test_bitcoin -t checkqueue_tests
./test/test_bitcoin -t checkqueue_tests  3.43s user 1.40s system 78% cpu 6.180 total

So we see about a 10x performance improvement here.

Cross Platform

It would be good to have reviewers comment with cross platform performance testing, as some of the lock free instructions may be slower on some platforms than others, and I don't have access to a large enough swath of machines. If you're reviewing performance, you may find the benchdiff.py tool I wrote useful JeremyRubin@14aa19a.

Real World/Simulated Numbers

I don't have great real-world-node numbers yet for these changes, but I'll wait for others (e.g., @morcos) to report back on those.

edit: I have run a simulation on a month of block validation on a 4-core machine and seen no difference in aggregate performance. I'll poke around to see if the design can be tweaked a little bit for some advantage with fewer cores, but the more important simulations are for machines with >7 cores or multiple cpus, where the contention relieved by this PR becomes more significant.

Notes

This builds on #9497 (and #9495). There were a couple minor nits on those PR's outstanding, but I figure they can be addressed here later with a squashme (rather than having to squash those, get them re-reviewed, and then issue this PR). An earlier (and worse) attempt at a similar design can be seen here for comparison #8464.

Acknowledgements

Thanks to @morcos and @sdaftuar for supporting and helping in the development of this work, and to various others (@TheBlueMatt, @theuni, and others) for review in feedback at various stages of this project.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I'm curious if you've benchmarked the lock-free version vs just dropping the last commit (possibly after removing the batching, if that was actually a loss). Given the contention on fAllOk (which could be trivially removed since you dont quit early on !fAllOk, btw) and check_mem_top/check_mem_bot, I'm surprised there is much of a win to be had on the last commit.

static std::atomic<size_t> n_calls;
bool operator()()
{
++n_calls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the test expose more parallellism, should probably use release/acquire here, no?

};
~MemoryCheck(){
fake_allocated_memory -= b;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace at end of line (and stray line, it looks like).

CCheckQueueControl<FailingCheck> control(fail_queue.get());
size_t remaining = i;
while (remaining) {
size_t r = GetRand(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: something > 10 might test parallellism a bit more, no? (and in a few other places)

control.Add(vChecks);
}
bool r =control.Wait();
BOOST_REQUIRE(r || end_fails);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to have an XOR here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

{
std::unique_lock<std::mutex> l(FrozenCleanupCheck::m);
// Wait until the queue has finished all jobs and frozen
FrozenCleanupCheck::cv.wait(l, [](){return FrozenCleanupCheck::nFrozen == 1;});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe close the "l" scope after this line so that the try_lock()s happen without FrozenCleanupCheck::m?

src/checkqueue.h Outdated
check_mem.emplace_back(std::forward<Args>(args)...);
}
}
void Flush(size_t s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation for Flush() (and how it interacts with the two versions of Add()) here?

src/checkqueue.h Outdated
typename std::vector<T>::iterator check_mem;
typename std::vector<T>::iterator check_mem_top;
typename std::vector<T>::iterator check_mem_bottom;
void Setup(typename std::vector<T>::iterator check_mem_in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace at EOL here.

/** The begin and end offsets into check_mem. 128 bytes of padding is
* inserted before and after check_mem_top to eliminate false sharing*/
std::atomic<uint32_t> check_mem_bot {0};
unsigned char _padding[128];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ifdef hardware_destructive_interference_size use it?

src/checkqueue.h Outdated
checks_iterator->swap(t);
std::advance(checks_iterator, 1);
pT->swap(t);
// Can't reveal result until after swapped, otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Sure we can, we just cant decrement nAwake until we've swap()ed. I assume you had a previous version with an early termination for !fAllOk?

Copy link
Contributor Author

@JeremyRubin JeremyRubin Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; you're correct. Earlier version was like this.

Also termination does still occur "early", all the actual checks are skipped (they are just stilled dequeued). I suppose I could make it abort (may actually be nice to check if it aborted before calling checkinputs... but maybe that's best for another PR).

I'll refactor to something which converges more quickly on abort (e.g., setting !fMasterPresent && !fAllOk)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its probably better to not have quick abort and have less between-thread contention, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution i came up with is pretty trivial (I'll push it later today).

fOk = fAllOk.load(std::memory_order_relaxed);
if (fOk) {
    T t();
    pT->swap(t);
} else {
    fAllOk.store(false, std::memory_order_relaxed);
    fMasterPresent = false;
    // try to consume all values as quickly as possible
    while (top_cache > bottom_cache &&
                    !check_mem_bot.compare_exchange_weak( bottom_cache, top_cache)) {
            top_cache = check_mem_top.load();
    }
}

Especially since now, fAllOk is never written really now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like too just complexity? Just dont bother exiting early, we can take a performance hit of very little on an invalid block, I think. Just dont bother writing to fAllOk if fOk?

std::atomic<unsigned int> nAwake;

/** If there is presently a master process either in the queue or adding jobs */
std::atomic<bool> fMasterPresent;

//! Whether we're shutting down.
bool fQuit;

//! The maximum number of elements to be processed in one batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, in the current version, you can remove nBatchSize as well (though I'm surprised its not a performance gain to batch operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. I didn't change it to minimize changeset, but it no longer has a use.

Update code to pass MAX_SCRIPTCHECKS_PER_BLOCK to CCheckQueueControl
… commit,

checks were stored in three separate vectors. First, a vector was made locally
to pass to a call to CCheckQueueControl::Add. In that call, checks were added to
the main CCheckQueue memory, then each thread had to locally (per batch) put the
check into a new, local vector. This means the location of the Check moved 2
times. In this patch, the check is created in a buffer associated with the
CCheckQueueControl. Instead of moving the check, a counter is incremented to
keep track of which threads have claimed which checks.

Also updates benchmark to use new in-place CheckQueue API
…free" style

to improve the concurrency on multicore machines.  In effect, it eliminates the
use of any "critical section" during block validation.
src/checkqueue.h Outdated
for (unsigned int i = 0; i < nNow && fOk; i++) {
fOk = (*checks_iterator)();
// execute work
fOk &= fOk && (*pT)();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

fOk &= fOk && T(std::move(*pT))()

and do away with all of the swap business?

JeremyRubin added a commit to JeremyRubin/bitcoin that referenced this pull request Aug 9, 2017
…). Fix a test to check the exclusive or of two properties rather than just or.
sipa added a commit that referenced this pull request Oct 12, 2017
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in #9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to #10026 and some feedback on #9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ #10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see #9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
@JeremyRubin
Copy link
Contributor Author

JeremyRubin commented Mar 6, 2018

Going to close this for now; it is a good target for future work if someone wants to take another crack at it.

I have a number of further relaxations/tweaks (which need more review) to this PR available here JeremyRubin#1 if anyone is taking it over, they should review those too.

@JeremyRubin JeremyRubin closed this Mar 6, 2018
@ryanofsky
Copy link
Contributor

Should be tagged up for grabs

HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request Mar 12, 2018
…). Fix a test to check the exclusive or of two properties rather than just or.
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Dec 22, 2019
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 2, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 4, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 12, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jan 16, 2020
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
@jonatack
Copy link
Member

Thanks @JeremyRubin for the really nice exposition of the issues in this PR description, found it helpful for reviewing #18710.

laanwj added a commit that referenced this pull request Jan 25, 2021
bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov)
6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov)
dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov)
0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov)
0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov)

Pull request description:

  This PR:
  - gets rid of `boost::thread_group` in the `CCheckQueue` class
  - allows thread safety annotation usage in the `CCheckQueue` class
  - is alternative to #14464 (#18710 (comment), #18710 (comment))

  Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from #9938.

  Related: #17307

ACKs for top commit:
  laanwj:
    Code review ACK bb6fcc7
  LarryRuane:
    ACK bb6fcc7
  jonatack:
    Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green

Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Jan 25, 2021
bb6fcc7 refactor: Drop boost::thread stuff in CCheckQueue (Hennadii Stepanov)
6784ac4 bench: Use CCheckQueue local thread pool (Hennadii Stepanov)
dba3069 test: Use CCheckQueue local thread pool (Hennadii Stepanov)
0151177 Add local thread pool to CCheckQueue (Hennadii Stepanov)
0ef9386 refactor: Use member initializers in CCheckQueue (Hennadii Stepanov)

Pull request description:

  This PR:
  - gets rid of `boost::thread_group` in the `CCheckQueue` class
  - allows thread safety annotation usage in the `CCheckQueue` class
  - is alternative to bitcoin#14464 (bitcoin#18710 (comment), bitcoin#18710 (comment))

  Also, with this PR (I hope) it could be easier to resurrect a bunch of brilliant ideas from bitcoin#9938.

  Related: bitcoin#17307

ACKs for top commit:
  laanwj:
    Code review ACK bb6fcc7
  LarryRuane:
    ACK bb6fcc7
  jonatack:
    Code review ACK bb6fcc7 and verified rebase to master builds cleanly with unit/functional tests green

Tree-SHA512: fddeb720d5a391b48bb4c6fa58ed34ccc3f57862fdb8e641745c021841c8340e35c5126338271446cbd98f40bd5484f27926aa6c3e76fa478ba1efafe72e73c1
gades pushed a commit to cosanta/cosanta-core that referenced this pull request May 1, 2022
8c2f4b8 Expose more parallelism with relaxed atomics (suggested in bitcoin#9938). Fix a test to check the exclusive or of two properties rather than just or. (Jeremy Rubin)

Pull request description:

  This PR is in response to bitcoin#10026 and some feedback on bitcoin#9938.

  ~Locally, all the checkqueue tests ran 3.2X faster on my machine. The worst offender, `test_CheckQueue_Correct_Random` ran 3.4X faster.~

  1. ~Removes `GetRand()` and replaces it with a single deterministic FastRandomContext instance.~ bitcoin#10321 replicated this

  1. Exposes more parallelism with relaxed atomics, increasing chance of catching a bug. This does not change performance on my machine.

  1. Makes one test case more restrictive (xor instead of or, see bitcoin#9938).

Tree-SHA512: a59dfbee0273c713525a130dfedc1c7ff26f50c2aaca1e94ef5d759b1d6ea6338ffbd97f863b9f6209750d8a788a15fa8ae1bf26774ed2473c520811337e6b00
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants