-
Notifications
You must be signed in to change notification settings - Fork 37.7k
[IBD] coins: increase default UTXO flush batch size to 32 MiB #31645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31645. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
FWIW, the reason for the existence of the batch size behavior (as opposed to just writing everything at once) is that it causes a memory usage spike at flush time. If that spike exceeds the memory the process can allocate it causes a crash, at a particularly bad time (may require a replay to fix, which may be slower than just reprocessing the blocks). Given that changing this appears to improve performance it's worth considering of course, but it is essentially a trade-off between speed and memory usage spiking. |
This comment was marked as abuse.
This comment was marked as abuse.
There is a config option. This is about changing the dedault. |
This comment was marked as abuse.
This comment was marked as abuse.
Thanks for the context, @sipa. |
Can we predict the memory usage spike size? Presumably as we flush, that releases memory, which allows for a larger and larger batch size? |
Since profilers may not catch these short-lived spikes, I've instrumented the code, loaded the UTXO set (as described the in the PR), parsed the logged flushing times and memory usages and plotted them against each other to see the effect of the batch size increase. txdb.cpp flush time and memory instrumentation:diff --git a/src/txdb.cpp b/src/txdb.cpp
--- a/src/txdb.cpp (revision d249a353be58868d41d2a7c57357038ffd779eba)
+++ b/src/txdb.cpp (revision bae884969d35469320ed9967736eb15b5d87edff)
@@ -90,7 +90,81 @@
return vhashHeadBlocks;
}
+/*
+ * Author: David Robert Nadeau
+ * Site: http://NadeauSoftware.com/
+ * License: Creative Commons Attribution 3.0 Unported License
+ * http://creativecommons.org/licenses/by/3.0/deed.en_US
+ */
+#if defined(_WIN32)
+#include <windows.h>
+#include <psapi.h>
+
+#elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) && defined(__MACH__))
+#include <unistd.h>
+#include <sys/resource.h>
+
+#if defined(__APPLE__) && defined(__MACH__)
+#include <mach/mach.h>
+
+#elif (defined(_AIX) || defined(__TOS__AIX__)) || (defined(__sun__) || defined(__sun) || defined(sun) && (defined(__SVR4) || defined(__svr4__)))
+#include <fcntl.h>
+#include <procfs.h>
+
+#elif defined(__linux__) || defined(__linux) || defined(linux) || defined(__gnu_linux__)
+#include <stdio.h>
+
+#endif
+
+#else
+#error "Cannot define getCurrentRSS( ) for an unknown OS."
+#endif
+
+/**
+ * Returns the current resident set size (physical memory use) measured
+ * in bytes, or zero if the value cannot be determined on this OS.
+ */
+size_t getCurrentRSS( )
+{
+#if defined(_WIN32)
+ /* Windows -------------------------------------------------- */
+ PROCESS_MEMORY_COUNTERS info;
+ GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) );
+ return (size_t)info.WorkingSetSize;
+
+#elif defined(__APPLE__) && defined(__MACH__)
+ /* OSX ------------------------------------------------------ */
+ struct mach_task_basic_info info;
+ mach_msg_type_number_t infoCount = MACH_TASK_BASIC_INFO_COUNT;
+ if ( task_info( mach_task_self( ), MACH_TASK_BASIC_INFO,
+ (task_info_t)&info, &infoCount ) != KERN_SUCCESS )
+ return (size_t)0L; /* Can't access? */
+ return (size_t)info.resident_size;
+
+#elif defined(__linux__) || defined(__linux) || defined(linux) || defined(__gnu_linux__)
+ /* Linux ---------------------------------------------------- */
+ long rss = 0L;
+ FILE* fp = NULL;
+ if ( (fp = fopen( "/proc/self/statm", "r" )) == NULL )
+ return (size_t)0L; /* Can't open? */
+ if ( fscanf( fp, "%*s%ld", &rss ) != 1 )
+ {
+ fclose( fp );
+ return (size_t)0L; /* Can't read? */
+ }
+ fclose( fp );
+ return (size_t)rss * (size_t)sysconf( _SC_PAGESIZE);
+
+#else
+ /* AIX, BSD, Solaris, and Unknown OS ------------------------ */
+ return (size_t)0L; /* Unsupported. */
+#endif
+}
+
bool CCoinsViewDB::BatchWrite(CoinsViewCacheCursor& cursor, const uint256 &hashBlock) {
+ const auto start = std::chrono::steady_clock::now();
+ size_t max_mem{getCurrentRSS()};
+
CDBBatch batch(*m_db);
size_t count = 0;
size_t changed = 0;
@@ -129,7 +203,11 @@
it = cursor.NextAndMaybeErase(*it);
if (batch.SizeEstimate() > m_options.batch_write_bytes) {
LogDebug(BCLog::COINDB, "Writing partial batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
+
+ max_mem = std::max(max_mem, getCurrentRSS());
m_db->WriteBatch(batch);
+ max_mem = std::max(max_mem, getCurrentRSS());
+
batch.Clear();
if (m_options.simulate_crash_ratio) {
static FastRandomContext rng;
@@ -146,8 +224,16 @@
batch.Write(DB_BEST_BLOCK, hashBlock);
LogDebug(BCLog::COINDB, "Writing final batch of %.2f MiB\n", batch.SizeEstimate() * (1.0 / 1048576.0));
+
+ max_mem = std::max(max_mem, getCurrentRSS());
bool ret = m_db->WriteBatch(batch);
+ max_mem = std::max(max_mem, getCurrentRSS());
+
LogDebug(BCLog::COINDB, "Committed %u changed transaction outputs (out of %u) to coin database...\n", (unsigned int)changed, (unsigned int)count);
+ if (changed > 0) {
+ const auto end{std::chrono::steady_clock::now()};
+ LogInfo("BatchWrite took=%dms, maxMem=%dMiB", duration_cast<std::chrono::milliseconds>(end - start).count(), max_mem >> 20);
+ }
return ret;
} Python script to load the utxo set, parse the logs and create the flush and memory plotsimport os
import re
import shutil
import statistics
import subprocess
import time
import datetime
import argparse
import matplotlib.pyplot as plt # python3.12 -m pip install matplotlib --break-system-packages
# Regex to parse logs
BATCHWRITE_REGEX = re.compile(r"^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z) BatchWrite took=(\d+)ms, maxMem=(\d+)MiB")
def parse_log(archive):
"""Parse the log file to extract elapsed times, flush times, and memory usage."""
start_time = None
elapsed, batchwrite_times, usage_snapshots = [], [], []
with open(archive, "r") as f:
for line in f:
if m := BATCHWRITE_REGEX.search(line):
dt = datetime.datetime.strptime(m.group(1), "%Y-%m-%dT%H:%M:%SZ")
if start_time is None:
start_time = dt
elapsed.append((dt - start_time).total_seconds())
batchwrite_times.append(int(m.group(2)))
usage_snapshots.append(int(m.group(3)))
return elapsed, batchwrite_times, usage_snapshots
def plot_results(results, output_dir):
"""Create separate plots for flush times and memory usage."""
if len(results) != 2:
print("plot_results() requires exactly 2 runs for comparison.")
return
(dbbatch0, elapsed0, flush0, mem0) = results[0]
(dbbatch1, elapsed1, flush1, mem1) = results[1]
# Compute percentage differences
avg_flush0, avg_flush1 = statistics.mean(flush0), statistics.mean(flush1)
max_mem0, max_mem1 = max(mem0), max(mem1)
flush_improvement = round(((avg_flush0 - avg_flush1) / avg_flush0) * 100, 1)
mem_increase = round(((max_mem1 - max_mem0) / max_mem0) * 100, 1)
# Plot flush times
plt.figure(figsize=(16, 8))
plt.plot(elapsed0, flush0, color="red", linestyle="-", label=f"Flush Times (dbbatch={dbbatch0})")
plt.axhline(y=avg_flush0, color="red", linestyle="--", alpha=0.5, label=f"Mean ({dbbatch0})={avg_flush0:.1f}ms")
plt.plot(elapsed1, flush1, color="orange", linestyle="-", label=f"Flush Times (dbbatch={dbbatch1})")
plt.axhline(y=avg_flush1, color="orange", linestyle="--", alpha=0.5, label=f"Mean ({dbbatch1})={avg_flush1:.1f}ms")
plt.title(f"Flush Times (dbbatch {dbbatch0} vs {dbbatch1}) — {abs(flush_improvement)}% {'faster' if flush_improvement > 0 else 'slower'}")
plt.xlabel("Elapsed Time (seconds)")
plt.ylabel("Flush Times (ms)")
plt.legend()
plt.grid(True)
plt.tight_layout()
flush_out_file = os.path.join(output_dir, "plot_flush_times.png")
plt.savefig(flush_out_file)
print(f"Flush Times plot saved as {flush_out_file}")
plt.close()
# Plot memory usage
plt.figure(figsize=(16, 8))
plt.plot(elapsed0, mem0, color="blue", linestyle="-", label=f"Memory (dbbatch={dbbatch0})")
plt.axhline(y=max_mem0, color="blue", linestyle="--", alpha=0.5, label=f"Max Mem ({dbbatch0})={max_mem0}MiB")
plt.plot(elapsed1, mem1, color="green", linestyle="-", label=f"Memory (dbbatch={dbbatch1})")
plt.axhline(y=max_mem1, color="green", linestyle="--", alpha=0.5, label=f"Max Mem ({dbbatch1})={max_mem1}MiB")
plt.title(f"Memory Usage (dbbatch {dbbatch0} vs {dbbatch1}) — {abs(mem_increase)}% {'higher' if mem_increase > 0 else 'lower'}")
plt.xlabel("Elapsed Time (seconds)")
plt.ylabel("Memory Usage (MiB)")
plt.legend()
plt.grid(True)
plt.tight_layout()
mem_out_file = os.path.join(output_dir, "plot_memory_usage.png")
plt.savefig(mem_out_file)
print(f"Memory Usage plot saved as {mem_out_file}")
plt.close()
def loadtxoutset(dbbatchsize, datadir, bitcoin_cli, bitcoind, utxo_file):
"""Load the UTXO set and run the Bitcoin node."""
archive = os.path.join(datadir, f"results_dbbatch-{dbbatchsize}.log")
# Skip if logs already exist
if os.path.exists(archive):
print(f"Log file {archive} already exists. Skipping loadtxoutset for dbbatchsize={dbbatchsize}.")
return
os.makedirs(datadir, exist_ok=True)
debug_log = os.path.join(datadir, "debug.log")
try:
print("Cleaning up previous run")
for subdir in ["chainstate", "chainstate_snapshot"]:
shutil.rmtree(os.path.join(datadir, subdir), ignore_errors=True)
print("Preparing UTXO load")
subprocess.run([bitcoind, f"-datadir={datadir}", "-stopatheight=1"], cwd=bitcoin_core_path)
os.remove(debug_log)
print(f"Starting bitcoind with dbbatchsize={dbbatchsize}")
subprocess.run([bitcoind, f"-datadir={datadir}", "-daemon", "-blocksonly=1", "-connect=0", f"-dbbatchsize={dbbatchsize}", f"-dbcache={440}"], cwd=bitcoin_core_path)
time.sleep(5)
print("Loading UTXO set")
subprocess.run([bitcoin_cli, f"-datadir={datadir}", "loadtxoutset", utxo_file], cwd=bitcoin_core_path)
except Exception as e:
print(f"Error during loadtxoutset for dbbatchsize={dbbatchsize}: {e}")
raise
finally:
print("Stopping bitcoind...")
subprocess.run([bitcoin_cli, f"-datadir={datadir}", "stop"], cwd=bitcoin_core_path, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
time.sleep(5)
shutil.copy2(debug_log, archive)
print(f"Archived logs to {archive}")
if __name__ == "__main__":
# Parse script arguments
parser = argparse.ArgumentParser(description="Benchmark Bitcoin dbbatchsize configurations.")
parser.add_argument("--utxo-file", required=True, help="Path to the UTXO snapshot file.")
parser.add_argument("--bitcoin-core-path", required=True, help="Path to the Bitcoin Core project directory.")
args = parser.parse_args()
utxo_file = args.utxo_file
bitcoin_core_path = args.bitcoin_core_path
datadir = os.path.join(bitcoin_core_path, "demo")
debug_log = os.path.join(datadir, "debug.log")
bitcoin_cli = os.path.join(bitcoin_core_path, "build/src/bitcoin-cli")
bitcoind = os.path.join(bitcoin_core_path, "build/src/bitcoind")
# Build Bitcoin Core
print("Building Bitcoin Core...")
subprocess.run(["cmake", "-B", "build", "-DCMAKE_BUILD_TYPE=Release"], cwd=bitcoin_core_path, check=True)
subprocess.run(["cmake", "--build", "build", "-j", str(os.cpu_count())], cwd=bitcoin_core_path, check=True)
# Run tests for each dbbatchsize
results = []
for dbbatchsize in [16777216, 67108864]: # Original and proposed
loadtxoutset(dbbatchsize, datadir, bitcoin_cli, bitcoind, utxo_file)
archive = os.path.join(datadir, f"results_dbbatch-{dbbatchsize}.log")
elapsed, batchwrite_times, usage_snapshots = parse_log(archive)
results.append((dbbatchsize, elapsed, batchwrite_times, usage_snapshots))
# Plot results
plot_results(results, bitcoin_core_path)
print("All configurations processed.") For standard dbcache values the results are very close (though the memory measurements aren't as scientific as I'd like them to be (probably because there is still enough memory), some runs even indicate that 16MiB consumes a bit more memory than the 64MiB version), but the trend seems to be clear from the produced plots: the batch writes are faster (and seem more predictable) with bigger batches, while the memory usage is only slightly higher. Is there any other way that you'd like me to test this @sipa, @luke-jr, @1440000bytes? |
d249a35
to
8684133
Compare
I think those graphs need to be on height rather than seconds. The larger dbbatchsize making it faster means it gets further in the chain, leading to the higher max at the end... I would expect both lines to be essentially overlapping except during flushes. |
I was only measuring the memory here during flushes. There is no direct height available there, but if we instrument
|
The UTXO set has grown significantly, and flushing it from memory to LevelDB often takes over 20 minutes after a successful IBD with large dbcache values. The final UTXO set is written to disk in batches, which LevelDB sorts into SST files. By increasing the default batch size, we can reduce overhead from repeated compaction cycles, minimize constant overhead per batch, and achieve more sequential writes. Experiments with different batch sizes (loaded via assumeutxo at block 840k, then measuring final flush time) show that 64 MiB batches significantly reduce flush time without notably increasing memory usage: | dbbatchsize | flush_sum (ms) | |-------------|----------------| | 8 MiB | ~240,000 | | 16 MiB | ~220,000 | | 32 MiB | ~200,000 | | *64 MiB* | *~150,000* | | 128 MiB | ~156,000 | | 256 MiB | ~166,000 | | 512 MiB | ~186,000 | | 1 GiB | ~186,000 | Checking the impact of a `-reindex-chainstate` with `-stopatheight=878000` and `-dbcache=30000` gives: 16 << 20 ``` 2025-01-12T07:31:05Z Flushed fee estimates to fee_estimates.dat. 2025-01-12T07:31:05Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes 2025-01-12T07:53:51Z Shutdown: done ``` Flush time: 22 minutes and 46 seconds 64 >> 20 ``` 2025-01-12T18:30:00Z Flushed fee estimates to fee_estimates.dat. 2025-01-12T18:30:00Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes 2025-01-12T18:44:43Z Shutdown: done ``` Flush time: ~14 minutes 43 seconds. Github-Pull: bitcoin#31645 Rebased-From: d249a35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 8684133
If that spike exceeds the memory the process can allocate it causes a crash, at a particularly bad time (may require a replay to fix, which may be slower than just reprocessing the blocks).
It is difficult for me to have a sense of how safe this change is but I'd hope we are not currently pushing systems so close to the edge that using an extra 48mb will cause them to start crashing. This does seem like a nice performance improvment if it doesn't cause crashes.
In theory, we could dynamically limit the batch size based on available memory to mitigate the risk of crashes. However, since the batch size is already small and further increases don’t provide much additional benefit (per the commit message), that added complexity probably isn’t worth it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK on raising the default from 16 to 67 megabytes (note that -dbbatchsize
is a hidden -help-debug
config option). Testing.
$ ./build/bin/bitcoind -help-debug | grep -A2 dbbatch
-dbbatchsize
Maximum database write batch size in bytes (default: 67108864)
1ec68cf
to
4b4dee0
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
4b4dee0
to
8fd522b
Compare
dbcache
size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 8fd522b
A few nits, feel free to pick/choose/ignore.
src/node/coins_view_args.h
Outdated
(dbcache_bytes / DEFAULT_KERNEL_CACHE) * DEFAULT_DB_CACHE_BATCH, | ||
/*lo=*/DEFAULT_DB_CACHE_BATCH, | ||
/*hi=*/256_MiB | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add a comment for this magic value, i.e. (from the PR description):
Capped at 256 MiB, as gains are barely measurable for bigger batches (see PR 31645)
also, clang-format
- /*hi=*/256_MiB
- );
+ /*hi=*/256_MiB);
and unneeded braces line 18
- (dbcache_bytes / DEFAULT_KERNEL_CACHE) * DEFAULT_DB_CACHE_BATCH,
+ dbcache_bytes / DEFAULT_KERNEL_CACHE * DEFAULT_DB_CACHE_BATCH,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capped at 256 MiB, as gains are barely measurable for bigger batches (see PR 31645)
I don't mind adding, but a simple blame would immediately reveal that
unneeded braces line 18
They may be implied, but I want to emphasize that mathematically this isn't associative, i.e. not the same as
dbcache_bytes / (DEFAULT_KERNEL_CACHE * DEFAULT_DB_CACHE_BATCH)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted min/max and added some comments
src/test/coins_tests.cpp
Outdated
@@ -20,6 +20,7 @@ | |||
#include <vector> | |||
|
|||
#include <boost/test/unit_test.hpp> | |||
#include <node/coins_view_args.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the most frequently seen order:
#include <coins.h>
+#include <node/coins_view_args.h>
#include <streams.h>
@@ -20,7 +21,6 @@
#include <vector>
#include <boost/test/unit_test.hpp>
-#include <node/coins_view_args.h>
using namespace util::hex_literals;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean we usually ignore the folder when sorting? I found examples for all sorts of includes (pun, intended) - I'll adjust if I have to push again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean we usually ignore the folder when sorting?
Think it's a case of keeping our headers separate from external dependencies/libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do next time I push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/node/coins_view_args.cpp
Outdated
if (auto value = args.GetIntArg("-dbbatchsize")) options.batch_write_bytes = *value; | ||
if (auto value = args.GetIntArg("-dbcrashratio")) options.simulate_crash_ratio = *value; | ||
if (const auto value = args.GetIntArg("-dbbatchsize")) options.batch_write_bytes = *value; | ||
else options.batch_write_bytes = GetDbBatchSize(args.GetIntArg("-dbcache", DEFAULT_KERNEL_CACHE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore, but the following would be more readable and follow the most frequent convention in this codebase:
- if (const auto value = args.GetIntArg("-dbbatchsize")) options.batch_write_bytes = *value;
- else options.batch_write_bytes = GetDbBatchSize(args.GetIntArg("-dbcache", DEFAULT_KERNEL_CACHE));
+ if (const auto value = args.GetIntArg("-dbbatchsize")) {
+ options.batch_write_bytes = *value;
+ } else {
+ options.batch_write_bytes = GetDbBatchSize(args.GetIntArg("-dbcache", DEFAULT_KERNEL_CACHE));
+ }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do if I repush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added braces
I'm not sure it makes sense to adjust this based on dbcache size. Won't a given batch size use the same amount of memory regardless of the size of the dbcache? |
It would, the assumption was that the user should be able to signal how much leftover memory they have - if they start the app with dbcache of 40MiB, allocating an extra 16MiB can be acceptable, but allocating an extra 64MiB can push the node over the edge. Even though we're not (yet?) preallocating the batch string, doubling the size to accommodate the content would end up with a similar size. Note, however, that this has changed slightly with the merge of #30611, since there's no significant difference between flushing with a dbcache of 4.5GiB and one with 45GiB (since we're flushing regularly now). I'm open to suggestions for which direction to take from here. |
The UTXO set has grown significantly, and flushing it from memory to LevelDB often takes over 20 minutes after a successful IBD with large dbcache values. The final UTXO set is written to disk in batches, which LevelDB sorts into SST files. By increasing the default batch size, we can reduce overhead from repeated compaction cycles, minimize constant overhead per batch, and achieve more sequential writes. Experiments with different batch sizes (loaded via assumeutxo at block 840k, then measuring final flush time) show that 64 MiB batches significantly reduce flush time without notably increasing memory usage: | dbbatchsize | flush_sum (ms) | |-------------|----------------| | 8 MiB | ~240,000 | | 16 MiB | ~220,000 | | 32 MiB | ~200,000 | | *64 MiB* | *~150,000* | | 128 MiB | ~156,000 | | 256 MiB | ~166,000 | | 512 MiB | ~186,000 | | 1 GiB | ~186,000 | Checking the impact of a `-reindex-chainstate` with `-stopatheight=878000` and `-dbcache=30000` gives: 16 << 20 ``` 2025-01-12T07:31:05Z Flushed fee estimates to fee_estimates.dat. 2025-01-12T07:31:05Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes 2025-01-12T07:53:51Z Shutdown: done ``` Flush time: 22 minutes and 46 seconds 64 >> 20 ``` 2025-01-12T18:30:00Z Flushed fee estimates to fee_estimates.dat. 2025-01-12T18:30:00Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes 2025-01-12T18:44:43Z Shutdown: done ``` Flush time: ~14 minutes 43 seconds. Github-Pull: bitcoin#31645 Rebased-From: 8684133
8fd522b
to
af653f3
Compare
Rebased, the PR is ready for review again! The batch size for UTXO set writes is now calculated based on the maximum |
af653f3
to
956a6b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the remaining nits that I promised a long time ago :)
Re-rebased, you can re-review with git range-diff af653f3...956a6b4
src/node/coins_view_args.cpp
Outdated
if (auto value = args.GetIntArg("-dbbatchsize")) options.batch_write_bytes = *value; | ||
if (auto value = args.GetIntArg("-dbcrashratio")) options.simulate_crash_ratio = *value; | ||
if (const auto value = args.GetIntArg("-dbbatchsize")) options.batch_write_bytes = *value; | ||
else options.batch_write_bytes = GetDbBatchSize(args.GetIntArg("-dbcache", DEFAULT_KERNEL_CACHE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added braces
src/node/coins_view_args.h
Outdated
(dbcache_bytes / DEFAULT_KERNEL_CACHE) * DEFAULT_DB_CACHE_BATCH, | ||
/*lo=*/DEFAULT_DB_CACHE_BATCH, | ||
/*hi=*/256_MiB | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted min/max and added some comments
src/test/coins_tests.cpp
Outdated
@@ -20,6 +20,7 @@ | |||
#include <vector> | |||
|
|||
#include <boost/test/unit_test.hpp> | |||
#include <node/coins_view_args.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/node/coins_view_args.h
Outdated
static constexpr size_t GetDbBatchSize(const size_t dbcache_bytes) | ||
{ | ||
const auto target{(dbcache_bytes / DEFAULT_KERNEL_CACHE) * DEFAULT_DB_CACHE_BATCH}; | ||
return std::max<size_t>(MIN_DB_CACHE_BATCH, std::min<size_t>(MAX_DB_CACHE_BATCH, target)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up with extracting min/max and adding an assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 956a6b4
- Note that #31645 (comment) refers to a change of approach from several months ago, prior to my previous review -- in that context, I found the comment confusing
- I would suggest not rebasing if there is no merge conflict; this eases reviewing the diff (and after, I'll usually rebase the PR branch to master locally anyway)
- Happy to re-ack if you take the review suggestions
src/kernel/caches.h
Outdated
@@ -16,6 +16,9 @@ static constexpr size_t MAX_BLOCK_DB_CACHE{2_MiB}; | |||
//! Max memory allocated to coin DB specific cache (bytes) | |||
static constexpr size_t MAX_COINS_DB_CACHE{8_MiB}; | |||
|
|||
//! The batch size of DEFAULT_KERNEL_CACHE | |||
static constexpr size_t DEFAULT_DB_CACHE_BATCH{16_MiB}; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6fa584f, seems it would make sense to place these two together.
//! Suggested default amount of cache reserved for the kernel (bytes)
static constexpr size_t DEFAULT_KERNEL_CACHE{450_MiB};
+//! Batch size of DEFAULT_KERNEL_CACHE
+static constexpr size_t DEFAULT_DB_CACHE_BATCH{16_MiB};
//! Max memory allocated to block tree DB specific cache (bytes)
static constexpr size_t MAX_BLOCK_DB_CACHE{2_MiB};
//! Max memory allocated to coin DB specific cache (bytes)
static constexpr size_t MAX_COINS_DB_CACHE{8_MiB};
-//! The batch size of DEFAULT_KERNEL_CACHE
-static constexpr size_t DEFAULT_DB_CACHE_BATCH{16_MiB};
-
namespace kernel {
struct CacheSizes {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, I'll do it next time I push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK 956a6b4
Have not confirmed the improvements in IBD speedup.
src/test/coins_tests.cpp
Outdated
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(DEFAULT_KERNEL_CACHE), DEFAULT_DB_CACHE_BATCH); | ||
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(0_MiB), DEFAULT_DB_CACHE_BATCH); | ||
|
||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could motivate the min value:
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); | |
static_assert(MIN_DB_CACHE == 4_MiB); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); |
src/test/coins_tests.cpp
Outdated
BOOST_AUTO_TEST_CASE(db_batch_sizes) | ||
{ | ||
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(DEFAULT_KERNEL_CACHE), DEFAULT_DB_CACHE_BATCH); | ||
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(0_MiB), DEFAULT_DB_CACHE_BATCH); | ||
|
||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(100_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(450_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(1000_MiB), 33'554'432); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(2000_MiB), 67'108'864); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(3000_MiB), 100'663'296); | ||
|
||
#if SIZE_MAX > UINT32_MAX | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4500_MiB), 167'772'160); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(7000_MiB), 251'658'240); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10000_MiB), 268'435'456); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45000_MiB), 268'435'456); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well make use of constexpr
?
BOOST_AUTO_TEST_CASE(db_batch_sizes) | |
{ | |
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(DEFAULT_KERNEL_CACHE), DEFAULT_DB_CACHE_BATCH); | |
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(0_MiB), DEFAULT_DB_CACHE_BATCH); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10_MiB), 16'777'216); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45_MiB), 16'777'216); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(100_MiB), 16'777'216); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(450_MiB), 16'777'216); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(1000_MiB), 33'554'432); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(2000_MiB), 67'108'864); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(3000_MiB), 100'663'296); | |
#if SIZE_MAX > UINT32_MAX | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4500_MiB), 167'772'160); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(7000_MiB), 251'658'240); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10000_MiB), 268'435'456); | |
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45000_MiB), 268'435'456); | |
#endif | |
} | |
// Verify DB batch sizes: | |
static_assert(node::GetDbBatchSize(DEFAULT_KERNEL_CACHE) == DEFAULT_DB_CACHE_BATCH); | |
static_assert(node::GetDbBatchSize(0_MiB) == DEFAULT_DB_CACHE_BATCH); | |
static_assert(node::GetDbBatchSize(4_MiB) == 16'777'216); | |
static_assert(node::GetDbBatchSize(10_MiB) == 16'777'216); | |
static_assert(node::GetDbBatchSize(45_MiB) == 16'777'216); | |
static_assert(node::GetDbBatchSize(100_MiB) == 16'777'216); | |
static_assert(node::GetDbBatchSize(450_MiB) == 16'777'216); | |
static_assert(node::GetDbBatchSize(1000_MiB) == 33'554'432); | |
static_assert(node::GetDbBatchSize(2000_MiB) == 67'108'864); | |
static_assert(node::GetDbBatchSize(3000_MiB) == 100'663'296); | |
#if SIZE_MAX > UINT32_MAX | |
static_assert(node::GetDbBatchSize(4500_MiB) == 167'772'160); | |
static_assert(node::GetDbBatchSize(7000_MiB) == 251'658'240); | |
static_assert(node::GetDbBatchSize(10000_MiB) == 268'435'456); | |
static_assert(node::GetDbBatchSize(45000_MiB) == 268'435'456); | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of compile time tests, they're usually harder to debug, when needed.
And these are as fast as they get, we wouldn't be saving any time. So it would be inconsistent with other tests, while being slightly harder to debug and not any faster.
Do you think there's any tangible advantage there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just default to testing at compile time:
- No need to (re)run ctest/test_bitcoin to exercise checks.
- Only run when compilation unit is (re)built. Not re-run when iterating on test code in other compilation units.
- Inconsistency might push other tests towards being converted to compile time, which I would say is a positive effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are extremely fast tests, keeping the usual test format has more advantages in my opinion.
If you have a string preference or if other reviewers prefer that, I don't mind changing, but I don't think the current one has measurable disadvantages compared to the suggestion - while being more in-line with how we usually test, making GetDbBatchSize
easily debuggable (often helps with understanding if you can step through it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 956a6b4
Change scales hidden -dbbatchsize
by the -dbcache
setting if unset.
Compared IBD until block 910'000 from local peer in 3 variations (both nodes on SSD, edit: both on NVMe); PR change for -dbcache
of 450 vs 45'000, and base commit for PR with 45'000.
- Baseline of
-dbcache=450
averaged 267mins across 3 runs. -dbcache=45000
without this PR averaged 241mins across 3 runs (these were probably the runs where I was doing the least work on the other node).-dbcache=45000
with this PR averaged 233mins across 3 runs.
That looks like at least an average 3% improvement in the -dbcache=45000
case for me. Still nice.
Edit: Did another run with PR and -dbcache=45000
and without working on any of the nodes, got a wall time of 218min. Strengthening the impression that this PR is beneficial.
Run log
With PR changes (956a6b4):
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=45000 -stopatheight=910000
real 238m57.330s
user 527m48.667s
sys 19m49.001s
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=450 -stopatheight=910000
real 266m16.473s
user 636m6.859s
sys 32m59.252s
repeated:
real 273m25.548s
user 640m5.442s
sys 31m58.674s
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=45000 -stopatheight=910000
real 228m12.943s
user 509m52.682s
sys 17m10.000s
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=450 -stopatheight=910000
real 261m47.071s
user 631m42.243s
sys 31m34.241s
PR base (d767503):
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=45000 -stopatheight=910000
real 246m56.108s
user 522m46.642s
sys 20m24.343s
again:
real 232m41.467s
user 508m7.763s
sys 19m16.589s
again:
real 242m17.316s
user 506m25.625s
sys 18m45.029s
=> 241
Again on PR:
rm -rf ~/.bitcoin && time ./build/bin/bitcoind -connect=workstation.lan -dbcache=45000 -stopatheight=910000
real 232m37.784s
user 490m58.944s
sys 18m3.817s
Averages:
450 runs:
266+273+262 = 267mins
45'000 without PR runs:
247+233+242 / 3 = 241min
45'000 with PR runs:
239+228+233 / 3 = 233min
src/test/coins_tests.cpp
Outdated
BOOST_AUTO_TEST_CASE(db_batch_sizes) | ||
{ | ||
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(DEFAULT_KERNEL_CACHE), DEFAULT_DB_CACHE_BATCH); | ||
BOOST_REQUIRE_EQUAL(node::GetDbBatchSize(0_MiB), DEFAULT_DB_CACHE_BATCH); | ||
|
||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(100_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(450_MiB), 16'777'216); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(1000_MiB), 33'554'432); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(2000_MiB), 67'108'864); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(3000_MiB), 100'663'296); | ||
|
||
#if SIZE_MAX > UINT32_MAX | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(4500_MiB), 167'772'160); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(7000_MiB), 251'658'240); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(10000_MiB), 268'435'456); | ||
BOOST_CHECK_EQUAL(node::GetDbBatchSize(45000_MiB), 268'435'456); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just default to testing at compile time:
- No need to (re)run ctest/test_bitcoin to exercise checks.
- Only run when compilation unit is (re)built. Not re-run when iterating on test code in other compilation units.
- Inconsistency might push other tests towards being converted to compile time, which I would say is a positive effect.
The constant for the default batch size is moved to `kernel/caches.h` to consolidate kernel-related cache constants.
The default database write batch size is increased from 16 MiB to 32 MiB to improve I/O efficiency and performance during UTXO flushes, particularly during Initial Block Download and `assumeutxo` loads. On systems with slower I/O, a larger batch size reduces overhead from numerous small writes. Measurements show this change provides a modest performance improvement on most hardware during a critical section, with a minimal peak memory increase (approx. 75 MiB on default settings).
dbcache
size956a6b4
to
b6f8c48
Compare
Thanks a lot for testing this @hodlinator and @Eunovo. It seems that different configurations behave differently here, so I've retested a few scenarios on a few different platforms to understand what the performance increase depends on exactly. It's not even a linear speedup (i.e. sometimes 32 MiB is faster, sometimes it's 64 MiB), some My recent measurement used First, on raspberry pi the difference is obviously a lot better with a bigger batch, even with a tiny memory increase: for very small dbcache, 64 MiB batch size would make `assumeutxo` load and dump ~59% faster on average on a Pi for 500 MB dbcache, 32 MiB would make it 30% faster
For a relatively performant i7 processor with HDD, I'm getting values are all over the place, but it's obvious that the current batch size isn't the best. Looks like dynamic sizing isn't as performant as powers-of-two. The 32 MiB version seems pretty good though. Details
Also measured on a very performant i9 with a really fast NVMe - making the difference between memory and background storage more blurry. Detailsdbcache=500 MB (dynamic batch size would be 17.8 MB) dbcache=1000 MB (dynamic batch size would be 35.6 MB) dbcache=1500 MB (dynamic batch size would be 53.3 MB) dbcache=2000 MB (dynamic batch size would be 71.1 MB) dbcache=2500 MB (dynamic batch size would be 88.9 MB) dbcache=3000 MB (dynamic batch size would be 106.7 MB) dbcache=3500 MB (dynamic batch size would be 124.4 MB) dbcache=4000 MB (dynamic batch size would be 142.2 MB) Edit: The above ones are the complete AssumeUTXO load & dump. Extracting only the batch saving times (the only change of the PR) on an i7 with HDD reveals a 40% speedup for default memory and a 3% speedup for bigger one with fewer batch writes: Measurement
Edit2: A similar measurement on a Raspberry PI for only the dumping part of the UTXOs reveals a 70% speedup (before: ~39 minutes vs ~23 minutes after) for default dbcache (consisting of 58 separate batch writes). So in summary, the dynamic scaling seems like a needless complication and 64 MiB seems to add too much extra memory. The above tests indicate that a constant Comparing the
vs the new proposed fix
Reveals that we're still a lot below 1GiB, the peak memory only increased by What about dbcache=4500?! it's 65 MiB bigger and 3% fasterI was interested in the total memory usage of
vs master:
|
This change is part of [IBD] - Tracking PR for speeding up Initial Block Download
Summary
When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush.
A hidden -dbbatchsize config option exists to modify this value. This PR only changes the default from 16 MiB to 32 MiB.
Using a larger default reduces the overhead of many small writes and improves I/O efficiency (especially on HDDs). It may also help LevelDB optimize writes more effectively (e.g., via internal ordering).
Context
The UTXO set has grown significantly since 2017, when the original fixed 16 MiB batch size was chosen.
With the current multi-gigabyte UTXO set and the common practice of using larger
-dbcache
values, the fixed 16 MiB batch size leads to several inefficiencies:WriteBatch
call incurs internal LevelDB overhead (e.g., MemTable handling, compaction triggering logic). More frequent, smaller batches amplify this cumulative overhead.Flush times of 20-30 minutes are not uncommon, even on capable hardware.
Considerations
As noted by sipa, flushing involves a temporary memory usage increase as the batch is prepared. A larger batch size naturally leads to a larger peak during this phase. Crashing due to OOM during a flush is highly undesirable, but now that #30611 is merged, the most we'd lose is the first hour of IBD.
Increasing the LevelDB write batch size from 16 to 32 MiB raised the measured peaks by ~70 MiB in my tests during UTXO dump. The option remains hidden, and users can always override it.
The increased peak memory usage (detailed below) is primarily attributed to LevelDB's
leveldb::Arena
(backing MemTables) and the temporary storage of serialized batch data (e.g.,std::string
inCDBBatch::WriteImpl
).Performance gains are most pronounced on systems with slower I/O (HDDs), but some SSDs also show measurable improvements.
Measurements:
AssumeUTXO proxy, multiple runs with error bars (flushing time is faster that the measured loading + flushing):
Reproducer:
This PR originally proposed 64 MiB, then a dynamic size, but both were dropped: 64 MiB increased peaks more than desired on low-RAM systems, and the dynamic variant underperformed across mixed hardware. 32 MiB is a simpler default that captures most of the gains with a modest peak increase.
For more details see: #31645 (comment)
While the PR isn't about IBD in general, rather about a critical section of it, I have measured a reindex-chainstate until 900k blocks, showing a 1% overall speedup:
Details