Skip to content

Conversation

dergoegge
Copy link
Member

@dergoegge dergoegge commented Apr 5, 2021

This picks up #19995 in an attempt to solve #21559.

The goal of this PR is to mitigate disk filling attacks by temporarily rate limiting LogPrintf(…).

A disk fill attack is an attack where an untrusted party (such as a peer) is able to cheaply make your node log to disk excessively. The excessive logging may fill your disk and thus make your node crash either cleanly (best case: if disk fill rate is relatively slow) or uncleanly (worst case: if disk fill rate is relatively fast).

Approach

The hourly logging quota is set per source location. Every single source location (say net_processing.cpp:418) gets a quota of 1 MB of logging per hour. Logging is only dropped from source locations that exceed the quota.

  • Only logging to disk is rate limited. Logging to console is not rate limited.
  • Only LogPrintf(…) is rate limited. LogPrint(category, …) (-debug) is not rate limited.
  • UpdateTip: new best=[…] is logged using LogPrintfWithoutRateLimiting(…) to avoid rate limiting. High log volume is expected for that source location during IBD.
  • When logging is restarted a tally of how many messages/bytes were dropped is printed.
  • Log rate limiting can be disabled with the -ratelimitlogging config option.
  • All logs will be prefixed with [*] if there is at least one source location that is currently being suppressed.

Alternatives

In a recent PR Review Club meeting we discussed this PR and evaluated alternative approaches.

Global rate limiting

Global rate limiting

There was some discussion around an alternative approach which would be to globally rate limit LogPrintf(…). This approach was implemented in a competing PR #21706. The main point [1] in favor of global rate limiting is that it could be confusing to have incomplete logs when logs from one source location are dropped but not from others. The main point against global rate limiting is that it opens another attack vector where an attacker could trigger rate limiting and then execute a 2. attack which would then not be document in the logs at all. With regard to the attack vector on the global approach and the overall reviews the two approaches have gotten I have chosen to continue with the approach in this PR.
(To address the concern of [1] i have chosen to prefix all logs with [*] if there is at least one source location that is currently being suppressed.)

logrotate

logrotate

LarryRuane brought up logrotate which can be used on linux to monitor log files and perform actions (delete, compress, send somewhere, etc.) on them when a certain size or point in time is reached. logrotate could be used to rate limit log but since we would like log rate limiting to be enabled by default it is not the best solution. Also on windows an alternative would have to be used.

Testing

I have a written a unit test which is contained in this PR. (Not the greatest test, it currently does not work on windows thanks to \r\n. If someone has an idea how to improve the test i would appreciate it)

Further more I made a RPC here that can log excessive amounts of "a"s from different locations which i have found useful during my own testing.
bitcoin-cli excessivelog <location (1-5)> <num_bytes>

⚠️One thing to note with that rpc is that the rate limiting logic still prints the last message that triggered the limiting to disk, so something like bitcoin-cli excessivelog 1 536870912 would still log ~512MiB to disk. Logging to console is also never suppressed (unless -printtoconsole=0) ⚠️

A simple example to use the rpc:

bitcoin-cli -regtest setmocktime 1
bitcoin-cli -regtest excessivelog 1 1048500 # log just under 1MiB
bitcoin-cli -regtest excessivelog 1 100 # this should get the total amount logged above 1MiB
                                        # and the rate limiting logic should kick in
bitcoin-cli -regtest setmocktime 3602 # let 1 hour pass (only works in regtest)
bitcoin-cli -regtest excessivelog 1 100 # this should trigger logging to resume

@dergoegge
Copy link
Member Author

In response to @jnewbery's comment

I think I'd prefer log-shedding to be done globally, with a special log at the end of a log suppression period that tallies up which log locations have been suppressed.

I disagree on the global schedding, i think the source location based schedding is over all the better approach.
I agree with your point that it could be confusing to drop some logs in the context of user issues but schedding should only kick in if there is an attack or something is horribly wrong, so i would think that almost all logs provided by users would be complete. We can also always tell if rate limiting is taking place since we log when it starts/stops and in those cases a partial log is better then no log.

(This is obviously mostly an opinion and I am totally willing to be convinced of global log-schedding if enough reviewers favor it)

@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 5, 2021

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #25306 (logging: add LogPrintfCategory to log unconditionally with category by jonatack)
  • #25287 (logging: threshold log level by klementtan)
  • #25203 (logging: update to severity-based logging by jonatack)
  • #24950 (Add config option to set max debug log size by tehelsper)
  • #23443 (p2p: Erlay support signaling by naumenkogs)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@jnewbery
Copy link
Contributor

jnewbery commented Apr 6, 2021

schedding should only kick in if there is an attack or something is horribly wrong, so i would think that almost all logs provided by users would be complete.

Right, and those cases are exactly where we need the best possible logs.

We can also always tell if rate limiting is taking place since we log when it starts/stops and in those cases a partial log is better then no log.

I guarantee you that not everyone will know this and most people will miss a single log line saying "rate-limiting log x".

I can only speak from my own experience of supporting network equipment for several years - partial logs are often worse than no logs at all since they're so misleading.

@practicalswift
Copy link
Contributor

Strongest possible Concept ACK: this mitigation will kill an entire bug class (the "disk fill via logging" bug class).

Thanks for taking up this work @dergoegge.

Will test and review.

@practicalswift
Copy link
Contributor

practicalswift commented Apr 6, 2021

@jnewbery

I think I'd prefer log-shedding to be done globally, with a special log at the end of a log suppression period that tallies up which log locations have been suppressed.

Clarifying questions to fully understand what is suggested:

Let's assume that log location A is a misplaced LogPrintf in a code path that the attacker can make us take.

In the case of an attacker using log location A to fill our disk by making us log from there repeatedly, is the suggestion then that all log locations should be suppressed during the suppression period (instead of suppressing only the "attacker controlled" log location A)?

To make your suggestion clear: could you exemplify what the special log entry at the end of the log suppression period would look like? (A patch or even a separate PR would be even better, but an example log entry would probably clarify enough!)

@practicalswift
Copy link
Contributor

@dergoegge

Thanks for your work on this PR!

Would you be willing to implement also @jnewbery's suggestions as a separate PR?

I tried to summarise his suggestion in #21603 (comment). I'm not certain I got it right though - @jnewbery, feel free to chime in :)

I think it makes sense to do it as separate PR since it deviates in important ways from the original suggestion. By having two separate PRs we would see which of the two approaches to address this type of attack that have consensus support.

Personally I think I could live with both approaches: as long as we kill the disk-fill-via-logging bug class sooner rather than later I'm happy :)

@dergoegge
Copy link
Member Author

Would you be willing to implement also jnewbery's suggestions as a separate PR?

@practicalswift Sure, thats a good idea :)

@rebroad
Copy link
Contributor

rebroad commented Apr 17, 2021

concept ACK (not that I'm sure if my opinion is relevant) and I think this makes more sense than the global alternative.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 3, 2021

🕵️ @sipa has been requested to review this pull request as specified in the REVIEWERS file.

@practicalswift
Copy link
Contributor

@dergoegge Thanks for working on this. Would you mind rebasing? I would like to review the updated version :)

dergoegge added 6 commits May 30, 2022 15:17
To mitigate disk filling attacks caused by unsafe usages of LogPrintf,
we rate limit LogPrintf by using the fixed window rate limiter
(BCLog::LogRatelimiter) introduced in an earlier commit.

The rate limiting logic is applied per source location instead of
globally. A source location is allowed to log up to 1 MiB per hour.
Source locations that violate the limit will have their logs supressed
for up to one hour.
We mark ~DebugLogHelper as noexcept(false) to be able to catch the
exception it throws. This lets us use it in test in combination with
BOOST_CHECK_THROW and BOOST_CHECK_NO_THROW to check that certain log
messages are (not) logged.
@dergoegge
Copy link
Member Author

It looks like many of the LogPrintf messages will be going away with the update to severity-based logging and the added complexity here may be worth reevaluating (and potentially avoiding) after that.

@jonatack IMO removing some of the LogPrintf locations (or assigning them a severity level which is not logged by default) does not really address the issue at its root like this PR is trying to do. Any remaining default logging could still cause issues as well as any logging that is introduced in the future.

@jonatack
Copy link
Member

jonatack commented May 30, 2022

What I am working on is that no logging on by default would be externally provokable (and potentially removing LogPrintf). That along with reducing default logging in general has appeared to be the desired direction for some time now AFAICS. Of course, this could still be merged in the interim and then removed if no longer needed, but I'm targeting the next release.

@jnewbery
Copy link
Contributor

reACK c86d5ff

@mzumsande
Copy link
Contributor

What I am working on is that no logging on by default would be externally provokable (and potentially removing LogPrintf). That along with reducing default logging in general has appeared to be the desired direction for some time now AFAICS. Of course, this could still be merged in the interim and then removed if no longer needed, but I'm targeting the next release.

With bd971bf, category-based logging with a severity >= BCLog::Level::Warning is equivalent to LogPrintf from the viewpoint of disk filling attacks.
I think that it would be really good to have this general countermeasure in place, especially with probable future PRs that convert more existing log messages to the new format, because we'll now have to be careful about not adding category-based warnings or errors that could be triggered externally. The old rule, which is no longer valid but probably still in many contributors' heads was that one only needs to be careful about LogPrintf, while category-base logging is not so critical.

@jonatack
Copy link
Member

jonatack commented May 31, 2022

convert more existing log messages to the new format, because we'll now have to be careful about not adding category-based warnings or errors that could be triggered externally

Yes, doing precisely this right now for all of them. Though, assuming people are in favor of that, maybe reviewers might prefer extending this rate limiting to non-default logging or otherwise it can always still be removed.

@maflcko
Copy link
Member

maflcko commented May 31, 2022

Is it possible to convert all of them? I think currently we don't have a way to express a log category but unconditional log severity. Unless you want to use Warning/Error for "unconditional".

@jonatack
Copy link
Member

jonatack commented May 31, 2022

@MarcoFalke added that in #25203.

@mzumsande
Copy link
Contributor

Though, assuming people are in favor of that, maybe reviewers might prefer extending this rate limiting to non-default logging.

Good point, I think that this would definitely be useful (here or as a follow-up) - having a rate limit only for LogPrintf but not for unconditionally logged category-based warnings and errors would be confusing.

@jonatack
Copy link
Member

jonatack commented May 31, 2022

Good point, I think that this would definitely be useful (here or as a follow-up) - having a rate limit only for LogPrintf but not for unconditionally logged category-based warnings and errors would be confusing.

I agree; though as an example just updated the (copious) net processing logging in #25203 and there are a only a few remaining unconditional messages (user-controlled or unusual):

src/net_processing.cpp:2305:                LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Large reorg, won't direct fetch to %s (%d)\n",
src/net_processing.cpp:2702:            LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Connected to self at %s, disconnecting\n", pfrom.addr.ToString());
src/net_processing.cpp:3425:                    LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Not relaying non-mempool transaction %s from forcerelay peer=%d\n", tx.GetHash().ToString(), pfrom.GetId());
src/net_processing.cpp:4190:        LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Not punishing misbehaving peer %d because it has noban permission set!\n", peer.m_id);
src/net_processing.cpp:4196:        LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Not punishing misbehaving peer %d because it was manually connected!\n", peer.m_id);
src/net_processing.cpp:4873:                        LogPrintLevel(BCLog::NET, BCLog::Level::Warning, "Announcing block %s not on main chain (tip=%s)\n",

@dergoegge
Copy link
Member Author

having a rate limit only for LogPrintf but not for unconditionally logged category-based warnings and errors would be confusing.

Also agree with this and i want to add that to this PR.

The two new categories I introduce here (UNCONDITIONAL_ALWAYS and UNCONDITIONAL_RATE_LIMITED) don't really make sense anymore, given that we now also have category based unconditional logging. Whether a log location should be rate limited could purely be based on the severity level.
I was thinking of the following: Adding an extra severity Always which is logged unconditionally without rate limiting (e.g. for this location) and rate limiting logging to disk if severity == None || (severity >= Warning && severity < Always). Additionally, categories specified by -debug=cat should still be logged unconditionally without rate limiting. @mzumsande @jonatack @MarcoFalke what do y'all think of that?

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".

@mzumsande
Copy link
Contributor

mzumsande commented Jun 22, 2022

I was thinking of the following: Adding an extra severity Always which is logged unconditionally without rate limiting (e.g. for this location) and rate limiting logging to disk if severity == None || (severity >= Warning && severity < Always). Additionally, categories specified by -debug=cat should still be logged unconditionally without rate limiting. @mzumsande @jonatack @MarcoFalke what do y'all think of that?

So, it seems to me that with the recent changes, we currently have

  1. Unconditional logging with no level and no category (LogPrintf)
  2. Conditional logging with category but no level (LogPrint) - not relevant to this PR
  3. Unconditional logging with category but no level (LogPrintfCategory)
  4. Logging with category and level, which may be unconditional depending on the level (LogPrintLevel)

The "missing" combination of logging by level but without a category is currently not possible as far as I can see.

If there would be a use case for having "Always" logged messages also to have a severity like "Warning" or "Error" we'd need another dimension - but I think that at least for the current uses this is not the case, so your suggestion makes sense to me.

However, I'm really not sure how stable the current state of the changes to the logging framework is, it seems very much in flow and I'm a little bit confused about what direction it is heading:

@maflcko
Copy link
Member

maflcko commented Jun 22, 2022

If there is a log category with Info, which is also the default level, you wouldn't need an Always level?

@mzumsande
Copy link
Contributor

If there is a log category with Info, which is also the default level, you wouldn't need an Always level?

I don't see how - the "Always" level is meant for noisy unconditional logs that can hit the rate limit, but are so important that they should be exempt from it - basically only UpdateTipLog during IBD.

@jonatack
Copy link
Member

jonatack commented Jun 23, 2022

I'm about to push an update to #25203 to incorporate #25287 and the changes discussed in #25306, and the current approach is that the None severity level is only used internally, i.e. in src/logging.{h.cpp}, and moved to the end of the enum to permit iterating through it similar to GetNetworkNames() / NET_MAX.

@jonatack
Copy link
Member

jonatack commented Jun 23, 2022

The logging you might be worried about limiting would be, with the current direction, IIRC:

  • LogPrintf
  • LogPrintfWithCategory
  • LogPrintLevel with level Info/Warning/Error (these levels are planned to be logged unconditionally, while Debug and Trace would not)

@dergoegge
Copy link
Member Author

Haven't had the time to maintain this. I still think some general mitigation for disk-filling would be nice, so I might pick this up again later unless someone else beats me to it. Closing for now.

@dergoegge dergoegge closed this Oct 7, 2022
@bitcoin bitcoin locked and limited conversation to collaborators Oct 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.