Skip to content

Conversation

jonatack
Copy link
Member

@jonatack jonatack commented Sep 6, 2021

Follow-up to #22736.

The first commit addresses the issue identified and reported by Martin Ankerl in #22736 (comment) to fix the lock contention duration reporting.

The next three commits make improvements to the timer code in BCLog::LogMsg() and add util/types.h with an ALWAYS_FALSE template, that springboard from #22736 (comment) by Marco Falke.

jonatack and others added 2 commits September 6, 2021 23:43
Co-authored-by: Martin Ankerl <martin.ankerl@gmail.com>
- make timer code more homogeneous
- replace division with multiplication
- log if the time type is unexpected
Copy link
Contributor

@martinus martinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 498b323, code review & tested. Found two one potential improvements.

@jonatack jonatack force-pushed the logging-and-lock-contention-fixups branch from e9af7c2 to f530202 Compare September 7, 2021 17:28
Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f530202

Thanks for catching the timing flaw @martinus (and also sorry to all for not noticing it in my review of #22736)! Indeed the numbers look way more plausible now -- on a signet non-debug build (src/bitcoind -signet -debug=lock), lock contention times are now in the range of hundreds to thousands micro-seconds on my machine.

SetMockTime(1);
auto minute_timer = BCLog::Timer<std::chrono::minutes>("tests", "end_msg");
SetMockTime(2);
BOOST_CHECK_EQUAL(minute_timer.LogMsg("test minutes"), "Error: unexpected time type");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit odd to add a new feature in one commit and then remove it in the next commit.

If you squash the two commits and keep the ordering, the diff will be a lot smaller, if not empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the last three commits separate as they do three different things and were contributed by three different people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 different commits by 3 different people is irrelevant if we're just adding code to immediately delete it. As long as bddae7e came first it didn't matter. So I don't understand why the remaining two commits, 498b323 and f530202, couldn't have been squashed together and just have the two authors.

@martinus
Copy link
Contributor

martinus commented Sep 8, 2021

re-ACK f530202. I ran a fully synced node for about a day. My node was mostly idle though so not much was going on. I wrote a little script to parse the debug.log and summarize the output to see if anything interesting was going on, here is the result:

total[ms] average[ms] count thread where
869.151 17.383 50 msghand cs_main, net_processing.cpp:4051
434.890 27.181 16 opencon cs, ./addrman.h:172
382.321 7.802 49 scheduler cs_main, net_processing.cpp:1476
153.231 19.154 8 opencon cs_main, net_processing.cpp:1152
98.148 0.008 11960 msghand cs_vNodes, net.cpp:2263
87.235 0.022 3983 msghand m_mutex, ./checkqueue.h:168
68.884 0.004 18252 msghand cs_vNodes, net.cpp:2228
66.666 1.449 46 msghand m_recent_confirmed_transactions_mutex, net_processing.cpp:1649
62.938 0.054 1167 scriptch.8 m_mutex, ./checkqueue.h:77
62.812 0.054 1173 scriptch.6 m_mutex, ./checkqueue.h:77
62.467 0.053 1178 scriptch.5 m_mutex, ./checkqueue.h:77
61.860 0.050 1241 scriptch.7 m_mutex, ./checkqueue.h:77
61.779 0.052 1192 scriptch.2 m_mutex, ./checkqueue.h:77
61.284 0.051 1200 scriptch.3 m_mutex, ./checkqueue.h:77
60.051 0.051 1181 scriptch.10 m_mutex, ./checkqueue.h:77
59.659 0.049 1207 scriptch.4 m_mutex, ./checkqueue.h:77
58.570 0.049 1207 scriptch.1 m_mutex, ./checkqueue.h:77
58.133 0.049 1175 scriptch.0 m_mutex, ./checkqueue.h:77
56.667 0.046 1245 scriptch.9 m_mutex, ./checkqueue.h:77
47.258 47.258 1 opencon cs, ./addrman.h:206
31.575 31.575 1 msghand cs, ./addrman.h:180
23.652 0.004 5278 net pnode->cs_hSocket, net.cpp:1539
23.523 3.920 6 msghand cs_main, net_processing.cpp:4522
17.490 0.008 2266 net pnode->cs_vSend, net.cpp:1359
15.514 0.287 54 msghand g_cs_orphans, net_processing.cpp:4051
9.338 0.004 2346 net cs_vNodes, net.cpp:1611
8.315 0.004 2080 net cs_vNodes, net.cpp:1522
6.490 0.004 1770 net cs_vNodes, net.cpp:1342
3.453 0.003 1033 net cs_vNodes, net.cpp:1236
2.838 0.218 13 msghand ::cs_main, validation.cpp:4950
2.663 2.663 1 msghand cs_main, net_processing.cpp:3130
1.745 0.291 6 scheduler cs_main, net_processing.cpp:4278
1.443 0.005 299 msghand newTaskMutex, scheduler.cpp:76
1.064 0.003 317 net cs_vNodes, net.cpp:1287
0.799 0.266 3 msghand cs_main, net_processing.cpp:3411
0.639 0.004 145 msghand cs_vNodes, ./net.h:828
0.615 0.308 2 msghand cs_main, net_processing.cpp:2921
0.487 0.061 8 msghand m_mutex, ./checkqueue.h:77
0.424 0.071 6 msghand g_cs_orphans, net_processing.cpp:3451
0.409 0.004 100 opencon cs_vNodes, net.cpp:1906
0.262 0.262 1 msghand cs_main, net_processing.cpp:3204
0.256 0.008 32 loadblk cs_main, validation.cpp:4498
0.242 0.004 58 msghand node.cs_hSocket, net.cpp:781
0.236 0.118 2 msghand cs_main, net_processing.cpp:2600
0.174 0.004 44 net mutexMsgProc, net.cpp:1630
0.110 0.110 1 scheduler cs_main, index/base.cpp:282
0.108 0.108 1 msghand g_cs_orphans, txorphanage.cpp:156
0.098 0.005 18 scheduler m_cs_callbacks_pending, scheduler.cpp:151
0.079 0.004 19 msghand m_cs_callbacks_pending, scheduler.cpp:183
0.078 0.004 20 addcon cs_vNodes, net.cpp:2120
0.051 0.003 15 msghand pnode->cs_vSend, net.cpp:3016
0.028 0.006 5 scheduler instance->m_cs_callbacks_pending, scheduler.cpp:168
0.016 0.005 3 scheduler m_cs_callbacks_pending, scheduler.cpp:137
0.014 0.003 5 net pnode->cs_vProcessMsg, net.cpp:1572
0.009 0.003 3 scheduler newTaskMutex, scheduler.cpp:76
0.005 0.002 2 msghand pfrom->cs_vProcessMsg, net_processing.cpp:4077
0.005 0.002 2 msghand m_cs_callbacks_pending, scheduler.cpp:137
0.004 0.002 2 opencon cs_vNodes, net.cpp:332
0.002 0.002 1 scheduler g_cs_orphans, txorphanage.cpp:175
0.001 0.001 2 loadblk m_mutex, random.cpp:419
0.001 0.001 2 msghand cs_args, util/system.cpp:1024
0.001 0.001 1 loadblk newTaskMutex, scheduler.cpp:76
0.001 0.001 1 loadblk cs_args, util/system.cpp:1024
0.001 0.001 1 net cs_hSocket, net.cpp:499
0.001 0.001 1 msghand cs_main, net_processing.cpp:2044

Almost all of the contention in cs_main, net_processing.cpp:4051 happened at the startup of the node.

@maflcko
Copy link
Member

maflcko commented Sep 9, 2021

I expect lock contention to be more pronounced in the GUI (or when there is an active RPC poll)

@maflcko maflcko merged commit d2dd169 into bitcoin:master Sep 9, 2021
@jonatack jonatack deleted the logging-and-lock-contention-fixups branch September 9, 2021 14:05
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Sep 11, 2021
…to fix duration, improve BCLog::LogMsg()

f530202 Make unexpected time type in BCLog::LogMsg() a compile-time error (Martin Ankerl)
bddae7e Add util/types.h with ALWAYS_FALSE template (MarcoFalke)
498b323 log, timer: improve BCLog::LogMsg() (Jon Atack)
8d2f847 sync: inline lock contention logging macro to fix time duration (Jon Atack)

Pull request description:

  Follow-up to bitcoin#22736.

  The first commit addresses the issue identified and reported by Martin Ankerl in bitcoin#22736 (comment) to fix the lock contention duration reporting.

  The next three commits make improvements to the timer code in `BCLog::LogMsg()` and add `util/types.h` with an `ALWAYS_FALSE` template, that springboard from bitcoin#22736 (comment) by Marco Falke.

ACKs for top commit:
  martinus:
    re-ACK f530202. I ran a fully synced node for about a day. My node was mostly idle though so not much was going on. I [wrote a little script](https://github.com/martinus/bitcoin-stuff/blob/main/scripts/parse-debuglog-contention-single.rb) to parse the `debug.log` and summarize the output to see if anything interesting was going on, here is the result:
  theStack:
    ACK f530202

Tree-SHA512: 37d093eac5590e1b5846ab5994d0950d71e131177d1afe4a5f7fcd614270f977e0ea117e7af788e9a74ddcccab35b42ec8fa4db3a3378940d4988df7d21cdaaa
@ajtowns
Copy link
Contributor

ajtowns commented Sep 29, 2021

This seems to introduce a busy loop when running the checkqueue tests for me:

$ make -j5 check &
$ tail -f test/*.log
2021-09-29T18:35:19.275154Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 started
2021-09-29T18:35:19.275213Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (306μs)
2021-09-29T18:35:19.275244Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (222μs)
2021-09-29T18:35:19.275275Z [scriptch.1] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (127μs)
2021-09-29T18:35:19.275293Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
2021-09-29T18:35:19.275319Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
2021-09-29T18:35:19.275347Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 completed (109μs)
2021-09-29T18:35:19.275366Z [scriptch.1] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
2021-09-29T18:35:19.275387Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (62μs)
2021-09-29T18:35:19.275404Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 started
2021-09-29T18:35:19.275425Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (69μs)

It does finish eventually with a >1GB log file:

$ ls -l test/*.log --sort=s -h -r | tail -n2
-rw-r--r-- 1 aj aj   20M Sep 30 04:30 test/validationinterface_tests.cpp.log
-rw-r--r-- 1 aj aj  1.3G Sep 30 04:35 test/checkqueue_tests.cpp.log
$ cat test/checkqueue_tests.cpp.log | grep 'Enter: lock contention' | cut -d\  -f8,9 | sort | uniq -c
      2 instance->m_cs_callbacks_pending, scheduler.cpp:168
      6 m_cs_callbacks_pending, scheduler.cpp:183
2599034 m_mutex, ./checkqueue.h:168
7399288 m_mutex, ./checkqueue.h:77
 161286 m, test/checkqueue_tests.cpp:66
      2 newTaskMutex, scheduler.cpp:76

I was thinking that the fact that logging takes its own lock might have been a problem (leading to contention on that lock? or just taking more time leading to more contention on other locks?) but it seems like there really is that many instances of contention for the checkqueue mutex -- reducing the logging by a factor of 1000 only reduces the logging proportionally. But gigabyte log files and many minutes to run a simple test doesn't seem great...

@theuni
Copy link
Member

theuni commented Sep 29, 2021

I suspect you have a bunch of spurious failures. Regardless of whether that's the case or not, relying on unspecified behavior (spurious mutex failures) from libc/stdlib is a bad idea, especially if it leads to possible remote triggers (3 of the 4 most contended locks are in net_processing).

IMO this is far too dangerous to have turned on by default.

@theuni
Copy link
Member

theuni commented Sep 29, 2021

Following up: if this were to be reverted, here is an alternative solution to the header macro problem for lock contention and lock ordering. As a caveat, I haven't benchmarked to see the impact of the lost inlining.

@martinus
Copy link
Contributor

martinus commented Sep 30, 2021

I suspect you have a bunch of spurious failures.

I think the contentions are real and not spurious. For me the file checkqueue_tests.cpp.log is by far the largest, and it looks like most the contentions come from the test test_CheckQueue_Correct_Random. When I run

./src/test/test_bitcoin --run_test=checkqueue_tests/test_CheckQueue_Correct_Random -- DEBUG_LOG_OUT >out.txt

I get a 1.1GB log file. Parsing this with with this script I get this output:

total[ms] average[ms] count thread where
44386.247 0.039 1133991 test m_mutex, ./checkqueue.h:168
41293.635 0.039 1065538 scriptch.0 m_mutex, ./checkqueue.h:77
41167.037 0.039 1068648 scriptch.1 m_mutex, ./checkqueue.h:77
41140.515 0.039 1068460 scriptch.2 m_mutex, ./checkqueue.h:77
17.652 0.041 434 test m_mutex, ./checkqueue.h:77

the first 4 threads have almost the same number of contentions, so it seems to me that they are blocking each other. Especially the 3 scriptch threads have practically the same number of contentions. I suspect that these 3 threads are always started simultaneously but only one of them gets the lock, blocking the other 2.

@fanquake
Copy link
Member

fanquake commented Oct 4, 2021

Clearly this needs following up on. I've opened an issue, #23167, for further discussion / so this doesn't get lost.

@jonatack
Copy link
Member Author

jonatack commented Oct 8, 2021

I think the contentions are real and not spurious.

Good call @martinus. Testing this further, it was not only the case on every run but also pre-existing with DEBUG_LOCKCONTENTION defined, as some of these tests are heavily contested by design. Logging contentions by default in the tests served to bring the issue to light, which was the goal.

Screenshot from 2021-10-08 13-07-06

Opened #23223 to resolve.

fanquake added a commit to bitcoin-core/gui that referenced this pull request Apr 8, 2022
…TION preprocessor directive

4394733 Add DEBUG_LOCKCONTENTION documentation to the developer notes (Jon Atack)
39a34b6 Put lock logging behind DEBUG_LOCKCONTENTION preprocessor directive (Jon Atack)

Pull request description:

  This is a more minimal, no-frills version of #24734 for backport. The other fixes and improvements in that pull can be done after.

  *Copy of the PR 24734 description:*

  PRs #22736, #22904 and #23223 changed lock contention logging from a `DEBUG_LOCKCONTENTION` compile-time preprocessor directive to a runtime `lock` log category and improved the logging output. This changed the locking from using `lock()` to `try_lock()`:

  - `void Mutex::UniqueLock::lock()` acquires the mutex and blocks until it gains access to it

  - `bool Mutex::UniqueLock::try_lock()` doesn't block but instead immediately returns whether it acquired the mutex; it may be used by `lock()` internally as part of the deadlock-avoidance algorithm

  In theory the cost of `try_lock` might be essentially the [same](https://www.erisian.com.au/bitcoin-core-dev/log-2022-03-31.html#l-697) relative to `lock`. The test-and-set logic of these calls is purported to be ~ constant time, optimised and light/quick if used carefully (i.e. no mutex convoying), compared to system calls, memory/cache coherency and fences, wait queues, and (particularly) lock contentions. See the discussion around bitcoin/bitcoin#22736 (comment) and after with respect to performance/cost aspects.  However, there are reasonable concerns (see [here](bitcoin/bitcoin#22736 (comment)) and [here](https://www.erisian.com.au/bitcoin-core-dev/log-2022-03-31.html#l-620)) that `Base::try_lock()` may be potentially [costly](https://www.erisian.com.au/bitcoin-core-dev/log-2022-03-31.html#l-700) or [risky](bitcoin/bitcoin#22904 (comment)) compared to `Base::lock()` in this very frequently called code.

  One alternative to keep the run-time lock logging would be to gate the `try_lock` call behind the logging conditional, for example as proposed in bitcoin/bitcoin@ccd73de and ACKed [here](bitcoin/bitcoin#22736 (comment)). However, this would add the [cost](bitcoin/bitcoin#22736 (comment)) of `if (LogAcceptCategory(BCLog::LOCK))` to the hotspot, instead of replacing `lock` with `try_lock`, for the most frequent happy path (non-contention).

  It turns out we can keep the advantages of the runtime lock contention logging (the ability to turn it on/off at runtime) while out of prudence putting the `try_lock()` call and `lock` logging category behind a  `DEBUG_LOCKCONTENTION` compile-time preprocessor directive, and also still retain the lock logging enhancements of the mentioned PRs, as suggested in bitcoin/bitcoin#24734 (comment) by W. J. van der Laan, in bitcoin/bitcoin#22736 (comment), and in the linked IRC discussion.

  Proposed here and for backport to v23.

ACKs for top commit:
  laanwj:
    Code review ACK 4394733

Tree-SHA512: 89b1271cae1dca0eb251914b1a60fc5b68320aab4a3939c57eec3a33a3c8f01688f05d95dfc31f91d71a6ed80cfe2d67b77ff14742611cc206175e47b2e5d3b1
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 28, 2022
Summary:
This diff:
- adds a LOCK logging category if DEBUG_LOCKCONTENTION is defined at compilation.
- adds a timing macro in microseconds, `LOG_TIME_MICROS_WITH_CATEGORY`
- updates `BCLog::LogMsg()` to omit irrelevant decimals for microseconds and skip unneeded code and math.
- improves the lock contention logging, drops the all-caps, and displays the duration in microseconds
- makes unexpected time type in `BCLog::LogMsg()` a compile-time error
- add a paragraph of documentation to developer-notes.md

Lock contention logging in checkqueue_tests is disabled, as some of these tests are designed to be heavily contested to trigger race conditions or other issues. This created very large log files when run with DEBUG_LOCKCONTENTION defined.

This is a backport of [[bitcoin/bitcoin#22736 | core#22736]], [[bitcoin/bitcoin#22904 | core#22904]], [[bitcoin/bitcoin#23223 | core#23223]] and [[bitcoin/bitcoin#24770 | core#24770]]

The first pull request implements the lock category, the other pull requests are important follow-ups (mixed with minor refactorings) fixing various regressions (bugs and suspected performance issues).

Co-authored-by: Hennadii Stepanov <32963518+hebasto@users.noreply.github.com>
Co-authored-by: practicalswift <practicalswift@users.noreply.github.com>
Co-authored-by: Martin Ankerl <martin.ankerl@gmail.com>

Test Plan:
`ninja all check-all`

Follow the instructions in the new documentation and check that the logging works:

```
cmake .. -GNinja -DCMAKE_CXX_FLAGS="-DDEBUG_LOCKCONTENTION"
ninja
src/bitcoind -h | grep -A8 "debug=<category>"
src/bitcoind -debug=lock
```

Reviewers: #bitcoin_abc, sdulfari

Reviewed By: #bitcoin_abc, sdulfari

Subscribers: sdulfari

Differential Revision: https://reviews.bitcoinabc.org/D12049
@bitcoin bitcoin locked and limited conversation to collaborators Oct 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants