-
Notifications
You must be signed in to change notification settings - Fork 37.7k
log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…) #19995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…) #19995
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK. I was wondering about this threat several times over last year, but never managed to look into it for real. |
|
Have you looked at the implementation? It certainly doesn't drop everything :) The implementation was carefully written to minimise the risk of suppressing something important from being logged while at the same time killing the disk-fill-via-peer-triggered-logging bug class for good. Do you have any ideas on how to improve the implementation in a way to reduce the "prevent important things from being logged" risk? Perhaps this implementation isn't the way to go but I really think we need some kind of general mitigation for disk fill attacks. This is a non-theoretical DoS vector which has been problematic for Bitcoin Core historically. |
I don't know how hard of a task that would be, but it would be great to see how easy it is to fill disk space depending on the protocol piece (message but also context). From there we could make sure our rate-limiting is effective? |
When testing this patch I reverted the fix for an historic disk fill bug to see if this mitigation would have been successful. In that case an untrusted peer could achieve a disk fill rate of 15 GB/hour without this mitigation, and the expected 1 MB/hour with this mitigation. (The 15 GB/hour was achieved on a really old machine with mechanical disks so YMMV.) In other words: the mitigation slowed that specific attack by a factor of 15 000 :) Note that the quota is set per source location. So every single source location (say Perhaps I misunderstood what you meant by the The goal of this PR is to kill this bug class for a |
if (print_quota_state_change_message) { | ||
str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed)); | ||
} else { | ||
skip_writing_to_disk_due_to_rate_limiting = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I thought this was a global flag.
9dfad47
to
9177339
Compare
@laanwj Thanks for the Concept ACK! :) Feedback addressed. |
9603200
to
0b402ae
Compare
0b402ae
to
a86ab57
Compare
// MUST NOT be rate limited no matter how often they are logged. That requirement | ||
// should be extremely rare, so please use with care. Prefer LogPrintf(...) if | ||
// possible. | ||
#define LogPrintfWithoutRateLimiting(...) LogPrintf_(__func__, __FILE__, __LINE__, /* skip_disk_usage_rate_limiting */ true, __VA_ARGS__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is really awkward. I would prefer a more compact name, but don't really have any good suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I agree :) Naming is hard! I'm happy to change the name of course.
I expect LogPrintfWithoutRateLimiting
to be used once: for UpdateTip
which is the only non-category log message which we ever expect to legitimately log at a rate of >1 MB/hour (or whatever hourly source-location threshold we choose).
One alternative would be to call LogPrintf_
directly from UpdateTip
as it is a one-off. That would solve the naming problem too :)
FWIW, this results in a 120 kB increase in stripped binary size for (still concept ACK though, and thanks for addressing my comments, code changes LGTM now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
Log shedding needs to be done very carefully, since logs are usually the best data we have when diagnosing user issues. I'm not convinced that shedding logs per location is a good idea. Partial logs are often worse than no logs at all and can be very misleading ("it logged this and then it didn't log this, so it can't have taken this code branch"). Even with a special "Suppressing logging to disk from this source location" log, it's very easy to be misled unless you know exactly what you're looking for.
I think I'd prefer log-shedding to be done globally, with a special log at the end of a log suppression period that tallies up which log locations have been suppressed.
const std::chrono::seconds now = GetTime<std::chrono::seconds>(); | ||
if ((now - m_last_reset_of_bytes_logged_per_source_location) > std::chrono::hours{1}) { | ||
m_bytes_logged_per_source_location.clear(); | ||
m_last_reset_of_bytes_logged_per_source_location = now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to check here whether logging was suppressed for this log, and if so log that logging is restarting for this log (eg. "Restarting logging from %s ...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I'll implement!
if (print_quota_state_change_message) { | ||
str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s:%d (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location.first, source_location.second, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed)); | ||
} else { | ||
skip_writing_to_disk_due_to_rate_limiting = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider keeping a tally of how many log messages were dropped, and then logging that number when you restart logging (eg "Restarting logging from <location>. <number> log messages dropped since <time>"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll implement.
Conceptually, I am not sure if it is good to fight the symptoms. I'd prefer if unconditional logs were only used for the init/shutdown sequence and local system errors, such as data corruption. Anything else the average user probably doesn't care about, and if they did, they could enable the corresponding debug category and provide enough disk space for the debug log file. |
I fail to see how introducing a mitigation would be to "fight the symptoms". The reason we use ASLR for example isn't that we've given up on the ambition to write non-vulnerable code :) The reason we use mitigations is to make exploitation harder in case we fail to write non-vulnerable free code, and thus hopefully turn a vulnerability into a normal bug. Have you looked at the implementation in this PR? :) The mitigation logic would only kick in in under extreme circumstances meaning that the only observable difference from a user perspective is that his/her node would be exploitable to disk fill attacks without the mitigation in place, whereas he/she would be safe with it. Note that this is not a theoretical issue: we've had exploitable disk filling vulnerabilities in the past which this mitigation would have successfully turned into just another non-exploitable non-security bug.
I agree that we should log more conservatively and ideally leave no room for disk filling attacks, but that doesn't preclude also protecting our users in case we fail to do so :) |
Can we get a way to disable this via config file? |
Could a maintainer mark this PR as "Up for grabs"? I'm afraid I won't have time to incorporate the suggested changes any time soon. FWIW I still think this mitigation is very much worth doing: the "up for grabs" state is purely due to lack of time on my part. I'd be glad to review any work being done to reduce the risk of disk fill attacks. |
Picked up in #21603. |
Mitigate disk filling attacks by temporarily rate limiting
LogPrintf(…)
.A disk fill attack is an attack where an untrusted party (such as a peer) is able to cheaply make your node log to disk excessively. The excessive logging may fill your disk and thus make your node crash either cleanly (best case: if disk fill rate is relatively slow) or uncleanly (worst case: if disk fill rate is relatively fast).
The hourly logging quota is set per source location. Every single source location (say
net_processing.cpp:418
) gets a quota of 1 MB of logging per hour.Notes:
LogPrintf(…)
is rate limited.LogPrint(category, …)
(-debug
) is not rate limited.UpdateTip: new best=[…]
is logged usingLogPrintfWithoutRateLimiting(…)
to avoid rate limiting. High log volume is expected for that source location during IBD.Live demo: