-
Notifications
You must be signed in to change notification settings - Fork 37.8k
Rework logging timer #24923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework logging timer #24923
The head ref may contain hidden characters: "2204-log-time-\u{1F4EC}"
Conversation
For testing |
Interesting, I'm curious what the lock logging will look like with variable units. |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK, tested with lock logging and the unit test.
src/logging/timer.h
Outdated
} | ||
return strprintf("%s: %s (%.2fμs)", m_prefix, msg, CountMicrosecondsDouble(dur)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the second commit, maybe inverse the order of the duration checks from the smallest unit to the largest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In higher frequency timing logs like lock, the smaller units seem to be more frequent. Debug build on a slow system:
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (253.02μs)
2022-04-19T17:59:03Z [msghand] Enter: lock contention cs_main, net_processing.cpp:4196 started
2022-04-19T17:59:03Z [msghand] Enter: lock contention cs_main, net_processing.cpp:4196 completed (14.28ms)
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:59:03Z [scheduler] Enter: lock contention ::cs_main, validation.cpp:5151 started
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (157.08ms)
2022-04-19T17:59:03Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 started
2022-04-19T17:59:03Z [scheduler] Enter: lock contention ::cs_main, validation.cpp:5151 completed (24.14ms)
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (10.48μs)
2022-04-19T17:59:03Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 completed (7.92ms)
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (5.57μs)
2022-04-19T17:59:03Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 started
2022-04-19T17:59:03Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 completed (85.71μs)
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:59:03Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (4.68μs)
2022-04-19T17:59:03Z [msghand] Enter: lock contention m_mutex, random.cpp:416 started
2022-04-19T17:59:03Z [msghand] Enter: lock contention m_mutex, random.cpp:416 completed (62.23μs)
2022-04-19T17:59:03Z [msghand] Enter: lock contention cs_main, net_processing.cpp:4668 started
2022-04-19T17:59:03Z [msghand] Enter: lock contention cs_main, net_processing.cpp:4668 completed (911.14μs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the order is intentional and it can not be changed without changing the behaviour of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, you could invert both the logic and the order?
if (dur <= 1ms) { /* microseconds */ }
else if (dur <= 1s) { /* milliseconds */ }
else { /* seconds */ }
Doing one integer comparison instead of two prior to a printf seems like an amazing level of microoptimisation though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing
strprintf("%s: %s (%.2fμs)", m_prefix, msg, CountMicrosecondsDouble(dur))
to
strprintf("%s: %s (%.5fms)", m_prefix, msg, CountMillisecondsDouble(dur))
? Then you'd see "10.23s", "10.23ms" and "0.01023ms" which might be easier to follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There aren't any in the sample above, but it's the big ones in seconds (s) that I find harder to pick out when eyeballing the mix (maybe because they are now smaller). The μs ones stand out pretty well to my eye. Sample with seconds:
2022-04-19T17:58:03Z [net] Enter: lock contention m_nodes_mutex, net.cpp:1294 started
2022-04-19T17:58:03Z [net] Enter: lock contention m_nodes_mutex, net.cpp:1294 completed (4.16μs)
2022-04-19T17:58:03Z [opencon] Enter: lock contention m_nodes_mutex, net.cpp:366 started
2022-04-19T17:58:03Z [opencon] Enter: lock contention m_nodes_mutex, net.cpp:366 completed (4.43μs)
2022-04-19T17:58:03Z [net] Enter: lock contention m_nodes_mutex, net.cpp:1294 started
2022-04-19T17:58:03Z [net] Enter: lock contention m_nodes_mutex, net.cpp:1294 completed (4.28μs)
2022-04-19T17:58:03Z [opencon] Enter: lock contention cs_main, net_processing.cpp:1229 started
2022-04-19T17:58:05Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 started
2022-04-19T17:58:05Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (12.17s)
2022-04-19T17:58:05Z [opencon] Enter: lock contention cs_main, net_processing.cpp:1229 completed (2.23s)
2022-04-19T17:58:05Z [msghand] Enter: lock contention ::cs_main, validation.cpp:5151 completed (4.40ms)
2022-04-19T17:58:05Z [scheduler] Enter: lock contention ::cs_main, validation.cpp:5151 completed (11.81s)
2022-04-19T17:58:05Z [msghand] Enter: lock contention cs_main, validation.cpp:2906 started
2022-04-19T17:58:05Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 started
2022-04-19T17:58:05Z [msghand] Enter: lock contention cs_main, validation.cpp:2906 completed (558.78μs)
2022-04-19T17:58:05Z [loadblk] Enter: lock contention cs_main, validation.cpp:4672 completed (735.17μs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, tangential, but: wouldn't it be better to log "foo.cpp:123 held the cs_main lock for 20 seconds" instead of "bar.cpp:456 was waiting for the cs_main lock for 20 seconds" ? In theory, you could then only report when some bit of code held a lock for an "unusually" long amount of time, and could do so even if no other thread happened to also want it... I just don't really see the point of a log entry for "m_nodes_mutex net.cpp:1294 (4.28μs)" -- that doesn't really seem like actionable information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's valuable information to show who waited, because just holding a lock when nobody else needs it isn't an issue. But what I'd really like to see is who actually blocked someone else.
I'd personally prefer exact timestamps from e.g. std::chrono::steady_clock
in the best precision possible in the logs, then one could extract a lot of useful information from it. E.g. who actually blocked whom and for how long. It would then also be possible to generate input for e.g. chrome tracing which is a simple JSON format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. If something took a short time, why would anyone care if it took 900 nanoseconds or 1700 nanoseconds? After all this isn't a microbenchmark and real world fluctuations are larger than nanosecond precision. Also, nanosecond precision didn't exist before, so introducing it can be done in a separate pull or a follow-up.
Changing the content of the lock contention logging or chrome tracing also appear largely unrelated to a change to the formatting of durations in logs.
I think for them it will actually be an improvement, as contention might happen for a short blink (a few microseconds) or, during a block connection, a bit longer (a few seconds). |
Yes, the lock logging has a high variance in timing. See #24923 (comment) for a test with this change. One downside is that it's more difficult to pick out the higher times by eyeballing the log with this change (similar to your point that "There is one argument for a fixed precision, as it makes debug log parsers slightly easier.") but still seems to be an improvement for the reasons in the OP. |
Yeah, I am happy to drop the second commit (or postpone it). Another thought would be to print the same value twice. The second time as |
Concept ACK. Would be nice if there's some way to make it easier to notice the difference between millis and micros when scanning logs. |
Thanks, I've reworked to make it easier for humans and scripts to parse the format. For example, |
First commit fa1d0f1 build error: Sample of last push ddddf68 with lock logging:
|
Good catch. I'll fix the dirty build and probably remove the nanosecond precision again. Previously the best precision we had was microseconds, so it is probably best to keep it. The debug log would then look a bit like:
|
Yes, I agree nanosecond precision is overkill here. For controlled benchmark situations it can come in handy but not for logging in a application running on (usually) a busy system. |
to clarify, my suggestion is already implemented |
fa83da6
to
ffff28a
Compare
Concept ACK |
ffff28a
to
fa35fe8
Compare
Would be good to make progress here 🥲 Currently the logs may look like this: Which requires the dev/user to spend a minute to count the digits to figure out if it means 1.6 seconds, or 16 seconds, or 160 seconds ... |
I wonder if the second commit is holding up the first one. The first function of this logging for me is to initially see where the high values are and then what is causing them, and the second commit makes that first step of seeing the higher values more difficult IMO with the multiple units displayed. Edit: I see that in the current version one unit (seconds) is now used, with two different precisions. Perhaps settle on one precision? |
Does it? It should be trivial to find small values, as they all start with |
I think so, as small values "look" large (more decimal places) and large values "look" small. A fixed precision may help? unfiltered debug log excerpt (mainnet) with this branch at fa35fe8
|
Ok, closing for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the Anyway, the first commit is now in #27005 |
The logging timer has many issues:
Fix all issues in this patch.