Skip to content

Conversation

practicalswift
Copy link
Contributor

@practicalswift practicalswift commented Sep 22, 2020

Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…).

A disk fill attack is an attack where an untrusted party (such as a peer) is able to cheaply make your node log to disk excessively. The excessive logging may fill your disk and thus make your node crash either cleanly (best case: if disk fill rate is relatively slow) or uncleanly (worst case: if disk fill rate is relatively fast).

The hourly logging quota is set per source location. Every single source location (say net_processing.cpp:418) gets a quota of 1 MB of logging per hour.

Notes:

  • Only logging to disk is rate limited. Logging to console is not rate limited.
  • Only LogPrintf(…) is rate limited. LogPrint(category, …) (-debug) is not rate limited.
  • UpdateTip: new best=[…] is logged using LogPrintfWithoutRateLimiting(…) to avoid rate limiting. High log volume is expected for that source location during IBD.

Live demo:

$ git diff
diff --git a/src/init.cpp b/src/init.cpp
index 023aa9aba..56a250876 100644
--- a/src/init.cpp
+++ b/src/init.cpp
@@ -1232,6 +1232,13 @@ bool AppInitInterfaces(NodeContext& node)
     return true;
 }
 
+void SimulateDiskFillingAttack() {
+    uint64_t n = 0;
+    while (true) {
+        LogPrintf("Chancellor on brink of %d:th bailout for banks. LogPrintf on brink of first disk fill for node.\n", ++n);
+    }
+}
+
 bool AppInitMain(const util::Ref& context, NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
 {
     const ArgsManager& args = *Assert(node.args);
@@ -1845,6 +1852,8 @@ bool AppInitMain(const util::Ref& context, NodeContext& node, interfaces::BlockA
     }
 #endif
 
+    std::thread{SimulateDiskFillingAttack}.detach();
+
     std::vector<fs::path> vImportFiles;
     for (const std::string& strFile : args.GetArgs("-loadblock")) {
         vImportFiles.push_back(strFile);
$ make -C src/ bitcoind
$ src/bitcoind
$ tail debug.log
2020-09-22T16:11:36Z Chancellor on brink of 8813:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8814:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8815:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8816:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8817:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8818:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8819:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Chancellor on brink of 8820:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:11:36Z Excessive logging detected from init.cpp:1238 (SimulateDiskFillingAttack): >1 MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: 2020-09-22T16:11:36Z Chancellor on brink of 8821:th bailout for banks. LogPrintf on brink of first disk fill for node.
2020-09-22T16:12:37Z Adding fixed seed nodes as DNS doesn't seem to be available.
$ ls -hl debug.log
–rw------- 1     1.1M Sep 22 16:12 debug.log

@DrahtBot
Copy link
Contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@naumenkogs
Copy link
Member

Concept ACK. I was wondering about this threat several times over last year, but never managed to look into it for real.

@laanwj
Copy link
Member

laanwj commented Sep 23, 2020

I'm not sure about this. This kind of "drop everything" rate limiting can be used for a different kind of DoS that might be even more serious than just shutting down: prevent important things from being logged.
Edit: I misunderstood the code here, it tracks per source location. Concept ACK.

@practicalswift practicalswift changed the title log: Mitigate disk filling attacks by rate limiting LogPrintf(…) log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…) Sep 23, 2020
@practicalswift
Copy link
Contributor Author

practicalswift commented Sep 23, 2020

@laanwj

Have you looked at the implementation? It certainly doesn't drop everything :)

The implementation was carefully written to minimise the risk of suppressing something important from being logged while at the same time killing the disk-fill-via-peer-triggered-logging bug class for good.

Do you have any ideas on how to improve the implementation in a way to reduce the "prevent important things from being logged" risk?

Perhaps this implementation isn't the way to go but I really think we need some kind of general mitigation for disk fill attacks. This is a non-theoretical DoS vector which has been problematic for Bitcoin Core historically.

@naumenkogs
Copy link
Member

naumenkogs commented Sep 23, 2020

I don't know how hard of a task that would be, but it would be great to see how easy it is to fill disk space depending on the protocol piece (message but also context).
For example,
tx INV: X bytes per second
ADDR: Y bytes per second
GETADDR: Z bytes per second

From there we could make sure our rate-limiting is effective?

@practicalswift
Copy link
Contributor Author

practicalswift commented Sep 23, 2020

@naumenkogs

When testing this patch I reverted the fix for an historic disk fill bug to see if this mitigation would have been successful.

In that case an untrusted peer could achieve a disk fill rate of 15 GB/hour without this mitigation, and the expected 1 MB/hour with this mitigation. (The 15 GB/hour was achieved on a really old machine with mechanical disks so YMMV.)

In other words: the mitigation slowed that specific attack by a factor of 15 000 :)

Note that the quota is set per source location. So every single source location (say net_processing.cpp:418) gets a quota of 1 MB of logging per hour. I think we could bump that to 10 MB/hour or even 100 MB/hour and still largely kill this bug class.

Perhaps I misunderstood what you meant by the INV/ADDR/GETADDR bytes per second thing, but we only log per such message via LogPrint(CATEGORY, …) (-debug). LogPrint(CATEGORY, …) is intentionally not covered by the rate limiting, since end-users opt-ing in to -debug are assumed to know that doing so may cause excessive disk usage :)

The goal of this PR is to kill this bug class for a bitcoind node running with default settings (no -debug).

if (print_quota_state_change_message) {
str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed));
} else {
skip_writing_to_disk_due_to_rate_limiting = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I thought this was a global flag.

@practicalswift practicalswift force-pushed the mitigate-log-disk-filling-attacks branch 3 times, most recently from 9dfad47 to 9177339 Compare September 24, 2020 20:02
@practicalswift
Copy link
Contributor Author

@laanwj Thanks for the Concept ACK! :)

Feedback addressed.

@practicalswift practicalswift force-pushed the mitigate-log-disk-filling-attacks branch 3 times, most recently from 9603200 to 0b402ae Compare September 25, 2020 09:26
@practicalswift practicalswift force-pushed the mitigate-log-disk-filling-attacks branch from 0b402ae to a86ab57 Compare September 25, 2020 14:24
// MUST NOT be rate limited no matter how often they are logged. That requirement
// should be extremely rare, so please use with care. Prefer LogPrintf(...) if
// possible.
#define LogPrintfWithoutRateLimiting(...) LogPrintf_(__func__, __FILE__, __LINE__, /* skip_disk_usage_rate_limiting */ true, __VA_ARGS__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is really awkward. I would prefer a more compact name, but don't really have any good suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I agree :) Naming is hard! I'm happy to change the name of course.

I expect LogPrintfWithoutRateLimiting to be used once: for UpdateTip which is the only non-category log message which we ever expect to legitimately log at a rate of >1 MB/hour (or whatever hourly source-location threshold we choose).

One alternative would be to call LogPrintf_ directly from UpdateTip as it is a one-off. That would solve the naming problem too :)

@laanwj
Copy link
Member

laanwj commented Sep 30, 2020

$ size -G src/bitcoind*
      text       data        bss      total filename
   6454125    1929533      38929    8422587 src/bitcoind
   6564925    1942029      38929    8545883 src/bitcoind.19995

FWIW, this results in a 120 kB increase in stripped binary size for bitcoind on x86_64. In contrast to what I first thought, this is not because of the __FILE__. There's only 12 kB more in strings. The rest is extra instructions in .text to pass the line number to every logging call.

(still concept ACK though, and thanks for addressing my comments, code changes LGTM now)

Copy link
Contributor

@jnewbery jnewbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

Log shedding needs to be done very carefully, since logs are usually the best data we have when diagnosing user issues. I'm not convinced that shedding logs per location is a good idea. Partial logs are often worse than no logs at all and can be very misleading ("it logged this and then it didn't log this, so it can't have taken this code branch"). Even with a special "Suppressing logging to disk from this source location" log, it's very easy to be misled unless you know exactly what you're looking for.

I think I'd prefer log-shedding to be done globally, with a special log at the end of a log suppression period that tallies up which log locations have been suppressed.

const std::chrono::seconds now = GetTime<std::chrono::seconds>();
if ((now - m_last_reset_of_bytes_logged_per_source_location) > std::chrono::hours{1}) {
m_bytes_logged_per_source_location.clear();
m_last_reset_of_bytes_logged_per_source_location = now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to check here whether logging was suppressed for this log, and if so log that logging is restarting for this log (eg. "Restarting logging from %s ...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll implement!

if (print_quota_state_change_message) {
str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s:%d (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location.first, source_location.second, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed));
} else {
skip_writing_to_disk_due_to_rate_limiting = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider keeping a tally of how many log messages were dropped, and then logging that number when you restart logging (eg "Restarting logging from <location>. <number> log messages dropped since <time>")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll implement.

@maflcko
Copy link
Member

maflcko commented Oct 13, 2020

Conceptually, I am not sure if it is good to fight the symptoms. I'd prefer if unconditional logs were only used for the init/shutdown sequence and local system errors, such as data corruption. Anything else the average user probably doesn't care about, and if they did, they could enable the corresponding debug category and provide enough disk space for the debug log file.

@practicalswift
Copy link
Contributor Author

practicalswift commented Oct 13, 2020

@MarcoFalke

Conceptually, I am not sure if it is good to fight the symptoms.

I fail to see how introducing a mitigation would be to "fight the symptoms".

The reason we use ASLR for example isn't that we've given up on the ambition to write non-vulnerable code :)

The reason we use mitigations is to make exploitation harder in case we fail to write non-vulnerable free code, and thus hopefully turn a vulnerability into a normal bug.

Have you looked at the implementation in this PR? :)

The mitigation logic would only kick in in under extreme circumstances meaning that the only observable difference from a user perspective is that his/her node would be exploitable to disk fill attacks without the mitigation in place, whereas he/she would be safe with it.

Note that this is not a theoretical issue: we've had exploitable disk filling vulnerabilities in the past which this mitigation would have successfully turned into just another non-exploitable non-security bug.

I'd prefer if unconditional logs were only used for the init/shutdown sequence and local system errors, such as data corruption. Anything else the average user probably doesn't care about, and if they did, they could enable the corresponding debug category and provide enough disk space for the debug log file.

I agree that we should log more conservatively and ideally leave no room for disk filling attacks, but that doesn't preclude also protecting our users in case we fail to do so :)

@luke-jr
Copy link
Member

luke-jr commented Oct 24, 2020

Can we get a way to disable this via config file?

@practicalswift
Copy link
Contributor Author

Could a maintainer mark this PR as "Up for grabs"? I'm afraid I won't have time to incorporate the suggested changes any time soon.

FWIW I still think this mitigation is very much worth doing: the "up for grabs" state is purely due to lack of time on my part. I'd be glad to review any work being done to reduce the risk of disk fill attacks.

@fanquake
Copy link
Member

fanquake commented Apr 7, 2021

Picked up in #21603.

@practicalswift practicalswift deleted the mitigate-log-disk-filling-attacks branch April 10, 2021 19:42
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants