Skip to content

Conversation

maflcko
Copy link
Member

@maflcko maflcko commented May 22, 2025

A failing CHECK_NONFATAL will throw an exception. This is fine and even desired in production builds, because the program may catch the exception and give the user a way to easily report the bug upstream.

However, in debug development builds, exceptions for internal bugs are problematic:

  • The exception could accidentally be caught and silently ignored
  • The exception does not include a full stacktrace, possibly making debugging harder

Fix all issues by turning the exception into an abort in debug builds.

This can be tested by reverting the hunks to src/rpc/node.cpp and test/functional/rpc_misc.py and then running the functional or fuzz tests.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 22, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32588.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, achow101

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

@DrahtBot DrahtBot changed the title util: Abort on failing CHECK_NONFATAL in debug builds util: Abort on failing CHECK_NONFATAL in debug builds May 22, 2025
@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch 2 times, most recently from fadb1c8 to fa033fb Compare May 22, 2025 14:34
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task multiprocess, i686, DEBUG: https://github.com/bitcoin/bitcoin/runs/42714583822
LLM reason (✨ experimental): The CI failure is due to the "rpc_tests" subprocess aborting during the test execution.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@ryanofsky
Copy link
Contributor

Concept ACK. Nice idea, and it does seem useful to have a macro checking for unexpected but not very serious conditions by throwing an exception that gets reported in release builds but is a fatal error in debug builds. And current uses of the macro seem like good candidates for that behavior.

The only possible issues I see are:

(1) The name CHECK_NONFATAL doesn't make a lot of sense anymore, now triggering fatal errors when it literally says "nonfatal" in the name.
(2) It is now more cumbersome to write unit tests checking for these conditions since they require a release build to run.

Both could be addressed in followups. Issue (2) could be addressed by having a g_abort_hook or similar hook allowing specific unit tests to write custom code to check for these errors if they want. (This could also be used to replace the g_debug_lockorder_abort variable which does something similar.)

IMO, issue (1) would be nice to address by coming up with a better designed set of checking macros and starting to use them. I think it could be good to have a:

  • CHECK to check conditions and abort if false in all builds
  • DCHECK to do the same but be compiled out of release builds,
  • CHECK_LOG to log an "internal bug detected please report" type log message in release builds, and abort in debug builds
  • CHECK_THROW to throw an exception in release builds, and abort in log builds.

Then, current assert uses could become CHECK, current Assume uses in hotspots could become DCHECK, majority of other Assume uses could become CHECK_LOG, and current CHECK_NONFATAL uses could become CHECK_THROW.

Just a thought though. Maybe current names are not a real problem, and naming shouldn't block this PR in any case.

@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch from fa033fb to faae9a2 Compare May 23, 2025 06:32
@maflcko
Copy link
Member Author

maflcko commented May 23, 2025

CHECK_THROW

I don't think this solves issue (1). Instead of NONFATAL being inaccurately named in debug builds, it will now be THROW, because the check is neither nonfatal nor throwing in debug builds.

Then, current assert uses could become CHECK, current Assume uses in hotspots could become DCHECK, majority of other Assume uses could become CHECK_LOG, and current CHECK_NONFATAL uses could become CHECK_THROW.

No objection, just mentioning that this would be a larger diff (including link-time changes, which are for some reason more involved in this area (#26688 (comment), #32543 (comment))), so a separate discussion/issue/pull seems better.

cumbersome to write unit tests

Thx, pushed a commit to fix this.

@achow101
Copy link
Member

achow101 commented Jun 2, 2025

Concept ACK

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK faae9a2. I think this is a good change. It makes sense conceptually to have check macros that always abort in debug builds, but do different things depending on cost of the check & severity of the error in release builds.

re: #32588 (comment)

I don't think this solves issue (1). Instead of NONFATAL being inaccurately named in debug builds, it will now be THROW, because the check is neither nonfatal nor throwing in debug builds.

IMO it does solve it, because the current issue is that the macro is literally doing the thing its name says it will not do (trigger a fatal error). By contrast, I don't think it is a problem for function name to just describe its primary purpose and not everything else it may do. No need to solve everything here though. Current PR seems like a step forward.

@DrahtBot DrahtBot requested a review from achow101 June 2, 2025 20:44
@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch from faae9a2 to fa68bda Compare July 22, 2025 14:22
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK fa68bda. Looks good! Since last review added some more test coverage and reverted some unneeded changes.

This does not change behavior, but documents that
G_ABORT_ON_FAILED_ASSUME is set when G_FUZZING_BUILD.
@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch 2 times, most recently from fae1423 to faabcc0 Compare July 24, 2025 11:34
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/runs/46638610833
LLM reason (✨ experimental): The CI failure is caused by a lint error due to Python code issues identified by ruff.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch 2 times, most recently from fa54573 to fac50ee Compare July 24, 2025 12:17
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK fac50ee, just rebased, added back functional test, and tweaked fuzz test since last review.

Overall this looks good and conceptually I like this change because it makes all the checking macros do the exact same thing in debug builds and abort, only having varying behavior in release builds.

This allows specific tests to mock the check behavior to consistently
use exceptions instead of aborts for intentionally failing checks in all
build configurations.
@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch 2 times, most recently from fa3f413 to fa27b23 Compare July 25, 2025 06:39
This requires adjusting some tests to force exceptions over aborts, or
accept either exceptions or aborts.

Also, remove a fuzz test in integer.cpp that is mostly redundant with
the unit test added in the prior commit.
@maflcko maflcko force-pushed the 2505-abort-debug-check-nonfatal branch from fa27b23 to fa37153 Compare July 25, 2025 06:44
Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK fa37153, just catching subprocess.CalledProcessError in test fixing up a comment since last review

@achow101
Copy link
Member

ACK fa37153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants