Skip to content

Conversation

fanquake
Copy link
Member

@fanquake fanquake commented Apr 25, 2017

Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

The current method for detecting available cores was introduced in #6361.

Recap of the IRC chat:

21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
21:16:43 BlueMatt: fair
21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
21:20:03 sipa: BlueMatt: don't worry, there is no hurry
21:59:10 morcos: wumpus: i don't think that is correct
21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
22:01:40 wumpus: hyperthreads are basically just a stored register state right?
22:02:23 sipa: wumpus: yes but it helps the scheduler
22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
22:02:37 morcos: well this is where i get out of my depth
22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
22:04:33 gmaxwell: (to use the HT core count)
22:04:44 wumpus: why was this changed at all then?
22:04:47 wumpus: I'm confused
22:05:04 sipa: good question!
22:05:06 gmaxwell: I had no idea we changed it.
22:05:25 wumpus: sigh 
22:05:54 gmaxwell: What PR changed it?
22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
22:07:22 wumpus: https://github.com/bitcoin/bitcoin/pull/6361
22:07:28 gmaxwell: PR 6461 btw.
22:07:37 gmaxwell: er lol at least you got it right.
22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
22:07:51 wumpus: so at least I got one thing right, woohoo
22:07:55 sipa: seems i even acked it!
22:07:57 BlueMatt: wumpus: there are more alus
22:08:38 BlueMatt: but we need to improve lock contention first
22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
22:09:01 BlueMatt: or we can just merge #10192, thats fee
22:09:04 gribble: https://github.com/bitcoin/bitcoin/issues/10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request #10192 · bitcoin/bitcoin · GitHub
22:09:11 BlueMatt: s/fee/free/
22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
22:10:51 sipa: *flies to US*
22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
22:11:30 BlueMatt: (because it means our thread contention issues are less)
22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
22:18:21 BlueMatt: sha2 speed is big
22:18:27 morcos: yeah lots of things to do actually... 
22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  
22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
22:21:50 BlueMatt: 1 sha round per tx
22:22:25 BlueMatt: and sigcache is obviously a ton

@laanwj
Copy link
Member

laanwj commented Apr 25, 2017

I think defaulting to the number of physical cores is reasonable, not sure if we should change this default. #6361 was done to make issues like #6358 better, where so many threads were created the whole system is overloaded (that was reported a lot at the time!).

So any benchmarking here should not just take speed into account but also user experience.

@jtimon
Copy link
Contributor

jtimon commented Jun 22, 2017

utACK e011c3b

I'm not sure I understand why using number of hyperthreads instead of number of cores causes issues. In my experience hyperthreads have gotten very good in the last years (not sure about the timeframes, just from my previous laptop to my current one and my desktop). For all purposes I tend to treat the number of hyprethreads as if that was the number of cores I have.

@laanwj
Copy link
Member

laanwj commented Jun 25, 2017

I'm not sure I understand why using number of hyperthreads instead of number of cores causes issues.

It might not cause issues for you personally on your specific hardware, but it is clear it caused issues for many people if you read the PRs/issues I linked above.

@fanquake fanquake force-pushed the replace-boost-getnumcores branch from e011c3b to b4f017a Compare October 12, 2017 00:51
@fanquake fanquake force-pushed the replace-boost-getnumcores branch from b4f017a to 937bf43 Compare December 15, 2017 06:48
Copy link
Contributor

@kallewoof kallewoof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 937bf43

@TheBlueMatt
Copy link
Contributor

Concept ACK.

@eklitzke
Copy link
Contributor

eklitzke commented Mar 6, 2018

utACK 937bf43

With respect to issues like #6358, the right way to handle this on Linux is to use SCHED_BATCH on the loadblk thread. I don't know what scheduler options are available on other platforms. I will put up a PR for this, as it can be done separately anyway.

@laanwj laanwj merged commit 937bf43 into bitcoin:master Mar 6, 2018
laanwj added a commit that referenced this pull request Mar 6, 2018
…, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in #6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: #6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge #10192, thats fee
  22:09:04 gribble: #10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request #10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
laanwj added a commit that referenced this pull request Apr 7, 2018
d54874d Set SCHED_BATCH priority on the loadblk thread. (Evan Klitzke)

Pull request description:

  Today I came across #10271, and while reading the discussion #6358 was linked to. Linux systems have a `SCHED_BATCH` scheduler priority that is useful for threads like loadblk. You can find the full details at [sched(7)](http://man7.org/linux/man-pages/man7/sched.7.html), but I'll quote the relevant part of the man page below:

  > ...this policy will cause the scheduler to always assume that the thread is
  CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty
  with respect to wakeup behavior, so that this thread is mildly disfavored in
  scheduling decisions.
  >
  > This policy is useful for workloads that are noninteractive, but do not want to
  lower their nice value, and for workloads that want a deterministic scheduling
  policy without interactivity causing extra preemptions (between the workload's
  tasks).

  I think this change is useful independently of #10271 and irrespective of whether that change is merged. Under normal operation the loadblk thread will just import `mempool.dat`. However, if Bitcoin is started with `-reindex` or `-reindex-chainstate` this thread will use a great deal of CPU while it rebuilds the chainstate database (and the block database in the case of `-reindex`). By setting `SCHED_BATCH` this thread is less likely to interfere with interactive tasks (e.g. the user's web browser, text editor, etc.).

  I'm leaving the nice value unchanged (which also affects scheduling decisions) because I think that's better set by the user. Likewise I'm not using [ioprio_set(2)](http://man7.org/linux/man-pages/man2/ioprio_set.2.html) because it can cause the thread to become completely I/O starved (and knowledgeable users can use `ionice(1)` anyway).

Tree-SHA512: ea8f7d3921ed5708948809da771345cdc33efd7ba3323e9dfec07a25bc21e8612e2676f9c178e2710c7bc437e8c9cafc5e0463613688fea5699b6e8e2fec6cff
@fanquake fanquake deleted the replace-boost-getnumcores branch January 22, 2020 11:29
deadalnix pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Apr 26, 2020
… available cores

Summary: This is a backport of Core [[bitcoin/bitcoin#10271 | PR10271]]

Test Plan:
  make check

Reviewers: #bitcoin_abc, majcosta

Reviewed By: majcosta

Differential Revision: https://reviews.bitcoinabc.org/D5843
furszy added a commit to PIVX-Project/PIVX that referenced this pull request Jun 8, 2020
3f3edde [Bench] Use PIVX address in Base58Decode test (random-zebra)
5a1be90 [Travis] Disable benchmark framework for trusty test (random-zebra)
1bd89ac Initialize recently introduced non-static class member lastCycles to zero in constructor (random-zebra)
ec60671 Require a steady clock for bench with at least micro precision (random-zebra)
84069ce bench: prefer a steady clock if the resolution is no worse (random-zebra)
38367b1 bench: switch to std::chrono for time measurements (random-zebra)
a24633a Remove countMaskInv caching in bench framework (random-zebra)
9e9bc22 Restore default format state of cout after printing with std::fixed/setprecision (random-zebra)
3dd559d Avoid static analyzer warnings regarding uninitialized arguments (random-zebra)
e85f224 Replace boost::function with std::function (C++11) (random-zebra)
98c0857 Prevent warning: variable 'x' is uninitialized (random-zebra)
7f0d4b3 FastRandom benchmark (random-zebra)
d9fa0c6 Add prevector destructor benchmark (random-zebra)
e1527ba Assert that what might look like a possible division by zero is actually unreachable (random-zebra)
e94cf15 bench: Fix initialization order in registration (random-zebra)
151c25f Basic CCheckQueue Benchmarks (random-zebra)
51aedbc Use std:thread:hardware_concurrency, instead of Boost, to determine available cores (random-zebra)
d447613 Use real number of cores for default -par, ignore virtual cores (random-zebra)
9162a56 [Refactoring] Removed using namespace <xxx> from bench/ sources (random-zebra)
5c07f67 bench: Add support for measuring CPU cycles (random-zebra)
41ce1ed bench: Fix subtle counting issue when rescaling iteration count (random-zebra)
68ea794 Avoid integer division in the benchmark inner-most loop. (random-zebra)
3fa4f27 bench: Added base58 encoding/decoding benchmarks (random-zebra)
4442118 bench: Add crypto hash benchmarks (random-zebra)
a5179b6 [Trivial] ensure minimal header conventions (random-zebra)
8607d6b Support very-fast-running benchmarks (random-zebra)
4aebb60 Simple benchmarking framework (random-zebra)

Pull request description:

  Introduces the benchmarking framework, loosely based on google's micro-benchmarking library (https://github.com/google/benchmark), ported from Bitcoin, up to 0.16.
  The benchmark framework is hard-coded to run each benchmark for one wall-clock second,
  and then spits out .csv-format timing information to stdout.

  Backported PR:
  - bitcoin#6733
  - bitcoin#6770
  - bitcoin#6892
  - bitcoin#8039
  - bitcoin#8107
  - bitcoin#8115
  - bitcoin#9200
  - bitcoin#9202
  - bitcoin#9281
  - bitcoin#6361
  - bitcoin#10271
  - bitcoin#9498
  - bitcoin#9712
  - bitcoin#9547
  - bitcoin#9505 (benchmark only. Rest was in #1557)
  - bitcoin#9792 (benchmark only. Rest was in #643)
  - bitcoin#10272
  - bitcoin#10395 (base58 only)
  - bitcoin#10963
  - bitcoin#11303 (first commit)
  - bitcoin#11562
  - bitcoin#11646
  - bitcoin#11654

  Current output of `src/bench/bench_pivx`:
  ```
  #Benchmark,count,min(ns),max(ns),average(ns),min_cycles,max_cycles,average_cycles
  Base58CheckEncode,131072,7697,8065,7785,20015,20971,20242
  Base58Decode,294912,3305,3537,3454,8595,9198,8981
  Base58Encode,180224,5498,6020,5767,14297,15652,14994
  CCheckQueueSpeed,320,3159960,3535173,3352787,8216030,9191602,8717388
  CCheckQueueSpeedPrevectorJob,96,9184484,11410840,10823070,23880046,29668680,28140445
  FastRandom_1bit,320,3143690,4838162,3199156,8173726,12579373,8317941
  FastRandom_32bit,60,17097612,17923669,17367440,44454504,46602306,45156079
  PrevectorClear,3072,334741,366618,346731,870340,953224,901516
  PrevectorDestructor,2816,344233,368912,357281,895022,959187,928948
  RIPEMD160,288,3404503,3693917,3577774,8851850,9604334,9302363
  SHA1,384,2718128,2891558,2802513,7067238,7518184,7286652
  SHA256,176,6133760,6580005,6239866,15948035,17108376,16223916
  SHA512,240,4251468,4358706,4313463,11054006,11332826,11215186
  Sleep100ms,10,100221470,100302411,100239073,260580075,260790726,260625870
  ```

  NOTE: Not all the tests have been pulled yet (as we might not have the code being tested, or it  would require rewrites to work with our different code base), but the framework is updated to December 2017.

ACKs for top commit:
  Fuzzbawls:
    ACK 3f3edde

Tree-SHA512: c283311a9accf6d2feeb93b185afa08589ebef3f18b6e86980dbc3647b9845f75ac9ecce2f1b08738d25ceac36596a2c89d41e4dbf3b463502aa695611aa1f8e
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 10, 2020
…f Boost, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in dashpay#6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: bitcoin#6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee
  22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 12, 2020
…f Boost, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in dashpay#6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: bitcoin#6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee
  22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 13, 2020
…f Boost, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in dashpay#6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: bitcoin#6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee
  22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 14, 2020
…f Boost, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in dashpay#6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: bitcoin#6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee
  22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 14, 2020
…f Boost, to determine available cores

937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in dashpay#6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: bitcoin#6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee
  22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
ftrader added a commit to bitcoin-cash-node/bitcoin-cash-node that referenced this pull request Aug 17, 2020
… available cores

Summary: This is a backport of Core [[bitcoin/bitcoin#10271 | PR10271]]

Test Plan:
  make check

Reviewers: #bitcoin_abc, majcosta

Reviewed By: majcosta

Differential Revision: https://reviews.bitcoinabc.org/D5843
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Dec 16, 2020
d54874d Set SCHED_BATCH priority on the loadblk thread. (Evan Klitzke)

Pull request description:

  Today I came across bitcoin#10271, and while reading the discussion dashpay#6358 was linked to. Linux systems have a `SCHED_BATCH` scheduler priority that is useful for threads like loadblk. You can find the full details at [sched(7)](http://man7.org/linux/man-pages/man7/sched.7.html), but I'll quote the relevant part of the man page below:

  > ...this policy will cause the scheduler to always assume that the thread is
  CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty
  with respect to wakeup behavior, so that this thread is mildly disfavored in
  scheduling decisions.
  >
  > This policy is useful for workloads that are noninteractive, but do not want to
  lower their nice value, and for workloads that want a deterministic scheduling
  policy without interactivity causing extra preemptions (between the workload's
  tasks).

  I think this change is useful independently of bitcoin#10271 and irrespective of whether that change is merged. Under normal operation the loadblk thread will just import `mempool.dat`. However, if Bitcoin is started with `-reindex` or `-reindex-chainstate` this thread will use a great deal of CPU while it rebuilds the chainstate database (and the block database in the case of `-reindex`). By setting `SCHED_BATCH` this thread is less likely to interfere with interactive tasks (e.g. the user's web browser, text editor, etc.).

  I'm leaving the nice value unchanged (which also affects scheduling decisions) because I think that's better set by the user. Likewise I'm not using [ioprio_set(2)](http://man7.org/linux/man-pages/man2/ioprio_set.2.html) because it can cause the thread to become completely I/O starved (and knowledgeable users can use `ionice(1)` anyway).

Tree-SHA512: ea8f7d3921ed5708948809da771345cdc33efd7ba3323e9dfec07a25bc21e8612e2676f9c178e2710c7bc437e8c9cafc5e0463613688fea5699b6e8e2fec6cff
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Dec 18, 2020
d54874d Set SCHED_BATCH priority on the loadblk thread. (Evan Klitzke)

Pull request description:

  Today I came across bitcoin#10271, and while reading the discussion dashpay#6358 was linked to. Linux systems have a `SCHED_BATCH` scheduler priority that is useful for threads like loadblk. You can find the full details at [sched(7)](http://man7.org/linux/man-pages/man7/sched.7.html), but I'll quote the relevant part of the man page below:

  > ...this policy will cause the scheduler to always assume that the thread is
  CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty
  with respect to wakeup behavior, so that this thread is mildly disfavored in
  scheduling decisions.
  >
  > This policy is useful for workloads that are noninteractive, but do not want to
  lower their nice value, and for workloads that want a deterministic scheduling
  policy without interactivity causing extra preemptions (between the workload's
  tasks).

  I think this change is useful independently of bitcoin#10271 and irrespective of whether that change is merged. Under normal operation the loadblk thread will just import `mempool.dat`. However, if Bitcoin is started with `-reindex` or `-reindex-chainstate` this thread will use a great deal of CPU while it rebuilds the chainstate database (and the block database in the case of `-reindex`). By setting `SCHED_BATCH` this thread is less likely to interfere with interactive tasks (e.g. the user's web browser, text editor, etc.).

  I'm leaving the nice value unchanged (which also affects scheduling decisions) because I think that's better set by the user. Likewise I'm not using [ioprio_set(2)](http://man7.org/linux/man-pages/man2/ioprio_set.2.html) because it can cause the thread to become completely I/O starved (and knowledgeable users can use `ionice(1)` anyway).

Tree-SHA512: ea8f7d3921ed5708948809da771345cdc33efd7ba3323e9dfec07a25bc21e8612e2676f9c178e2710c7bc437e8c9cafc5e0463613688fea5699b6e8e2fec6cff
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants