-
Notifications
You must be signed in to change notification settings - Fork 37.7k
[IBD] Tracking PR for speeding up Initial Block Download #32043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs. Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified. One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:
|
6f0ac8b
to
0f6a8e4
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
0f6a8e4
to
a06a674
Compare
Wouldn't these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple "lower is better" comparison? (rather than "further left is better") Plotting the difference between the various proposed commits and the baseline ( |
Fair; that might work better with a time on the x-axis rather than height, something like: for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
for height in range(1,850000):
y = time_baseline[height] / time_commit[height] # keep this one the same
x = time_baseline[height] # changed from x = height
add_datapoint(x,y) |
|
Concept ACK, thank you for opening this. I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP. Opened #32051 today to address an issue I'm also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD. |
Endianness doesn't affect the final size, we can skip it for `SizeComputer`. We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 191,652.29 | 5,217.78 | 0.4% | 10.96 | `SerializeBlock` | 10,323.55 | 96,865.92 | 0.2% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 614,847.32 | 1,626.42 | 0.0% | 8,015,883.64 | 2,207,628.07 | 3.631 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 26,020.31 | 38,431.52 | 0.0% | 159,390.03 | 93,438.33 | 1.706 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`
Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector). It makes sense to avoid the generalized serialization infrastructure that isn't needed: * AutoFile write doesn't need to allocate 4k buffer for a single byte now; * `VectorWriter` and `DataStream` avoids memcpy/insert calls. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 174,569.19 | 5,728.39 | 0.6% | 10.89 | `SerializeBlock` | 10,241.16 | 97,645.21 | 0.0% | 11.00 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 615,000.56 | 1,626.01 | 0.0% | 8,015,883.64 | 2,208,340.88 | 3.630 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 25,676.76 | 38,945.72 | 0.0% | 159,390.03 | 92,202.10 | 1.729 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`
The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs. This update to the tests ensures that: * Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios. * Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values. * Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks. Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness. To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.
> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 372,743.63 | 2,682.81 | 1.1% | 10.99 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,304,694.54 | 302.60 | 0.5% | 11.05 | `DuplicateInputs` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,096,261.84 | 912.19 | 0.1% | 7,963,390.88 | 3,487,375.26 | 2.283 | 1,266,941.00 | 1.8% | 11.03 | `CheckBlockBench` | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 8,366,309.48 | 119.53 | 0.0% | 23,865,177.67 | 26,620,160.23 | 0.897 | 5,972,887.41 | 4.0% | 10.78 | `DuplicateInputs`
The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context. Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 9,585.10 | 104,328.55 | 0.1% | 11.03 | `ProcessTransactionBench` > C++ compiler .......................... GNU 13.3.0 | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 56,199.57 | 17,793.73 | 0.1% | 229,263.01 | 178,766.31 | 1.282 | 15,509.97 | 0.5% | 10.91 | `ProcessTransactionBench`
`IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only. The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in). I've added braces to the conditions and loops to simplify review of followup commits. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 335,917.12 | 2,976.92 | 1.3% | 11.01 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,286,337.42 | 304.29 | 1.1% | 10.90 | `DuplicateInputs` | 9,561.02 | 104,591.35 | 0.2% | 11.02 | `ProcessTransactionBench`
No need to create a set for checking duplicates for two-input-transactions. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 314,137.30 | 3,183.32 | 1.2% | 11.04 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,220,592.73 | 310.50 | 1.3% | 10.92 | `DuplicateInputs` | 9,425.98 | 106,089.77 | 0.3% | 11.00 | `ProcessTransactionBench`
A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks. It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 181,922.54 | 5,496.85 | 0.2% | 10.98 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 997,739.30 | 1,002.27 | 1.0% | 10.94 | `DuplicateInputs` | 9,449.28 | 105,828.15 | 0.3% | 10.99 | `ProcessTransactionBench` Co-authored-by: Pieter Wuille <pieter@wuille.net>
For the 2 input case we simply check them both, like we did with equality. For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 179,971.00 | 5,556.45 | 0.3% | 11.02 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 963,177.98 | 1,038.23 | 1.7% | 10.92 | `DuplicateInputs` | 9,410.90 | 106,259.75 | 0.3% | 11.01 | `ProcessTransactionBench` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 834,855.94 | 1,197.81 | 0.0% | 6,518,548.86 | 2,656,039.78 | 2.454 | 919,160.84 | 1.5% | 10.78 | `CheckBlockBench` | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 4,261,492.75 | 234.66 | 0.0% | 17,379,823.40 | 13,559,793.33 | 1.282 | 4,265,714.28 | 3.4% | 11.00 | `DuplicateInputs` | 55,819.53 | 17,914.88 | 0.1% | 227,828.15 | 177,520.09 | 1.283 | 15,184.36 | 0.4% | 10.91 | `ProcessTransactionBench`
Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size: Direct (stack) allocated script types (size ≤ 28 bytes): * OP_RETURN (small) * P2WPKH * P2SH * P2PKH Indirect (heap) allocated script types (size > 28 bytes): * P2WSH * P2TR * P2PK * MULTISIG (small) This test provides a baseline for verifying changes to prevector's inline capacity.
The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (bitcoin#6914) before SegWit and TapRoot. However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage. The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`). Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads. Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost. Performance benchmarks for AssumeUTXO load and flush show: - Small dbcache (450MB): ~1% performance penalty due to more frequent flushes - Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup. Co-authored-by: Ava Chow <github@achow101.com> Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
e751cba
to
b6b4235
Compare
🐙 This pull request conflicts with the target branch and needs rebase. |
248b6a2 optimization: peel align-head and unroll body to 64 bytes (Lőrinc) e7114fc optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t` (Lőrinc) 478d40a refactor: encapsulate `vector`/`array` keys into `Obfuscation` (Lőrinc) 377aab8 refactor: move `util::Xor` to `Obfuscation().Xor` (Lőrinc) fa5d296 refactor: prepare mempool_persist for obfuscation key change (Lőrinc) 6bbf2d9 refactor: prepare `DBWrapper` for obfuscation key change (Lőrinc) 0b8bec8 scripted-diff: unify xor-vs-obfuscation nomenclature (Lőrinc) 9726979 bench: make ObfuscationBench more representative (Lőrinc) 618a30e test: compare util::Xor with randomized inputs against simple impl (Lőrinc) a5141cd test: make sure dbwrapper obfuscation key is never obfuscated (Lőrinc) 54ab0bd refactor: commit to 8 byte obfuscation keys (Lőrinc) 7aa557a random: add fixed-size `std::array` generation (Lőrinc) Pull request description: This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043) ### Summary Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks. Since this obfuscation is optional, the speedup measured here depends on whether it's a [random value](#31144 (comment)) or [completely turned off](#31144 (comment)) (i.e. XOR-ing with 0). ### Changes in testing, benchmarking and implementation * Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks. * Migrated `std::vector<std::byte>(8)` keys to plain `uint64_t`; * Process unaligned bytes separately and unroll body to 64 bytes. ### Assembly Memory alignment is enforced by a small peel-loop (`std::memcpy` is optimized out on tested platform), with an `std::assume_aligned<8>` check, see the Godbolt listing at https://godbolt.org/z/59EMv7h6Y for details <details> <summary>Details</summary> Target & Compiler | Stride (per hot-loop iter) | Main operation(s) in loop | Effective XORs / iter -- | -- | -- | -- Clang x86-64 (trunk) | 64 bytes | 4 × movdqu → pxor → store | 8 × 64-bit GCC x86-64 (trunk) | 64 bytes | 4 × movdqu/pxor sequence, enabled by 8-way unroll | 8 × 64-bit GCC RV32 (trunk) | 8 bytes | copy 8 B to temp → 2 × 32-bit XOR → copy back | 1 × 64-bit (as 2 × 32-bit) GCC s390x (big-endian 14.2) | 64 bytes | 8 × XC (mem-mem 8-B XOR) with key cached on stack | 8 × 64-bit </details> ### Endianness The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker. <details> <summary>Details</summary> ```bash brew install podman pigz softwareupdate --install-rosetta podman machine init podman machine start docker run --platform linux/s390x -it ubuntu:latest /bin/bash apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \ cd /mnt && git clone --depth=1 https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \ cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \ ./build/bin/test_bitcoin --run_test=streams_tests ``` </details> ### Measurements (micro benchmarks and full IBDs) > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc/clang -DCMAKE_CXX_COMPILER=g++/clang++ && \ cmake --build build -j$(nproc) && \ build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 <details> <summary>GNU 14.2.0</summary> > Before: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.84 | 1,184,138,235.64 | 0.0% | 9.01 | 3.03 | 2.971 | 1.00 | 0.1% | 5.50 | `ObfuscationBench` > After (first optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.04 | 28,365,698,819.44 | 0.0% | 0.34 | 0.13 | 2.714 | 0.07 | 0.0% | 5.33 | `ObfuscationBench` > and (second optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.03 | 32,464,658,919.11 | 0.0% | 0.50 | 0.11 | 4.474 | 0.08 | 0.0% | 5.29 | `ObfuscationBench` </details> <details> <summary>Clang 20.1.7</summary> > Before: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.89 | 1,124,087,330.23 | 0.1% | 6.52 | 3.20 | 2.041 | 0.50 | 0.2% | 5.50 | `ObfuscationBench` > After (first optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.08 | 13,012,464,203.00 | 0.0% | 0.65 | 0.28 | 2.338 | 0.13 | 0.8% | 5.50 | `ObfuscationBench` > and (second optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.02 | 41,231,547,045.17 | 0.0% | 0.30 | 0.09 | 3.463 | 0.02 | 0.0% | 5.47 | `ObfuscationBench` </details> i.e. 27.4x faster obfuscation with GCC, 36.7x faster with Clang For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144 ------ Running an IBD until 888888 blocks reveals a 4% speedup. <details> <summary>Details</summary> SSD: ```bash COMMITS="8324a00bd4a6a5291c841f2d01162d8a014ddb02 5ddfd31"; \ STOP_HEIGHT=888888; DBCACHE=1000; \ CC=gcc; CXX=g++; \ BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \ (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \ hyperfine \ --sort 'command' \ --runs 1 \ --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \ --parameter-list COMMIT ${COMMITS// /,} \ --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && \ cmake --build build -j$(nproc) --target bitcoind && \ ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \ --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \ "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0" ``` > 8324a00 test: Compare util::Xor with randomized inputs against simple impl > 5ddfd31 optimization: Xor 64 bits together instead of byte-by-byte ```python Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00) Time (abs ≡): 25033.413 s [User: 33953.984 s, System: 2613.604 s] Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31) Time (abs ≡): 24110.710 s [User: 33389.536 s, System: 2660.292 s] Relative speed comparison 1.04 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00) 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31) ``` > HDD: ```bash COMMITS="71eb6eaa740ad0b28737e90e59b89a8e951d90d9 4685403"; \ STOP_HEIGHT=888888; DBCACHE=4500; \ CC=gcc; CXX=g++; \ BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \ (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \ hyperfine \ --sort 'command' \ --runs 2 \ --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \ --parameter-list COMMIT ${COMMITS// /,} \ --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \ ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \ --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \ "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0" ``` > 71eb6ea test: compare util::Xor with randomized inputs against simple impl > 4685403 optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t` ```python Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea) Time (mean ± σ): 37676.293 s ± 83.100 s [User: 36900.535 s, System: 2220.382 s] Range (min … max): 37617.533 s … 37735.053 s 2 runs Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403) Time (mean ± σ): 36181.287 s ± 195.248 s [User: 34962.822 s, System: 1988.614 s] Range (min … max): 36043.226 s … 36319.349 s 2 runs Relative speed comparison 1.04 ± 0.01 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea) 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403) ``` </details> ACKs for top commit: achow101: ACK 248b6a2 maflcko: review ACK 248b6a2 🎻 ryanofsky: Code review ACK 248b6a2. Looks good! Thanks for adapting this and considering all the suggestions. I did leave more comments below but non are important and this looks good as-is Tree-SHA512: ef541cd8a1f1dc504613c4eaa708202e32ae5ac86f9c875e03bcdd6357121f6af0860ef83d513c473efa5445b701e59439d416effae1085a559716b0fd45ecd6
We're making progress, #31144 was just merged! 🎉 |
…line d5104cf prevector: store `P2WSH`/`P2TR`/`P2PK` scripts inline (Lőrinc) 5212150 test: assert `CScript` allocation characteristics (Lőrinc) 65ac7f6 refactor: modernize `CScriptBase` definition (Lőrinc) 756da2a refactor: extract `STATIC_SIZE` constant to prevector (Lőrinc) Pull request description: This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043) ### Summary The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (#6914) before `SegWit` and `TapRoot`. However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage. The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`). ### Context Increasing the `prevector` size allows these scripts to be stored inline, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads. Due to memory alignment, increasing the prevector size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost. <details> <summary>Massif measurements</summary> > dbcache=440 Massif before, with a heap threshold of `28`: ```bash MB 744.1^# |#: ::::::@: ::::::: :@:: @::::::::::::::@@ |#: ::::::@::::: ::: :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#::::::::@::::: ::: : :@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: :::::::@:::@:::::: :: ::::@ |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: 0 +----------------------------------------------------------------------->h 0 1.805 ``` and after, with a heap threshold of `36`: ```bash MB 744.2^ : |# : ::::::::::: : : :: ::: @@:::::: :: : |# : :::: :::::: : : :: ::: @ :: :: : : |# : :::: ::::::: : :@:: ::: @ :: :: : ::: |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# :: :::: ::::::: : :@:: ::: @ :: :: : : : |# :: :::: ::::::: : :@:: ::::@ :: :: : : : |#:::: :::: ::::::: :::@:: ::::@ :: :: : : : |#: ::::::: ::::::: :::@:: ::::@ :: :: @: : : |#: ::::::: ::::::: :::@:::::::@ :: :: @: : : |#: ::::::: ::::::::::::@:::::::@ :: :: @: : : |#: ::::::: :::::::: :::@:::::::@ :: :: @: : : |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: 0 +----------------------------------------------------------------------->h 0 1.618 ``` --- > for `dbcache=4500`: Massif before, with a heap threshold of `28`: ```bash GB 4.565^ :: | ##: @@::: :::: :@:::: :::: :::: | # : @ :: ::: :@: :: : :: ::: | # : @ :: ::::: :@: :: : :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : ::@ :: : ::: :@: :: @: :: ::: | # : : @ :: : ::: :@: :: @: :: ::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: @:: | # : : @ :: : ::: ::@: :: @: :::::: @: | # : : @ :: : ::: ::@: :::@: :::::: @: | # : : @ :: : ::: ::@: :::@: :::::: @: :::::::::::::::::::::::::::::@::: 0 +----------------------------------------------------------------------->h 0 1.500 ``` and after, with a heap threshold of `36`: ``` GB 4.640^ : | ##:: ::::: :::: ::::::@ :::: | # :: : ::: :::: :: :::@ :::: | # :: :: ::: :::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: :@@ | # :: :: ::: ::::: ::: :::@ :::::::@ | # :: :: ::: ::::: ::: :::@ ::::: :@ | # :: :: ::: ::::: ::: :::@::::::: :@ | # ::::: ::: ::::: ::: :::@: ::::: :@ | # ::::: ::: ::::: ::: :::@: ::::: :@ | # ::::: :::::::::: ::: :::@: ::::: :@ | # ::::: :::: ::::: ::: :::@: ::::: :@ | # ::::: :::: ::::: ::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ ::::::@:::@:::@::::@:::::@::::@:: 0 +----------------------------------------------------------------------->h 0 1.360 ``` </details> ### Benchmarks and Memory Performance benchmarks for `AssumeUTXO` load and flush show: - Small dbcache (450MB): ~1-3% performance improvement (despite more frequent flushes) - Large dbcache (4500MB): ~6-8% performance improvement due to fewer heap allocations (and basically the number of flushes) - Very large dbcache (4500MB): ~5-6% performance improvement due to fewer heap allocations (and memory limit not being reached, so there's no memory penalty) Full IBD and `-reindex-chainstate` with also show an overall ~3-4% speedup (both for smaller and larger dbcache values). We haven't investigated using different `prevector` sizes based on script type, though this could be explored in the future if needed. ### Historical explanation for the speedup (by [Anthony Towns](#32279 (comment))) > I think the tradeoff is something like: > > * spends of p2pk, p2sh, p2pkh coins -- these cost 8 more bytes > * spends of p2wpkh -- these cost 16 more bytes (sPK and scriptSig didn't need an allocation) > * spends of p2wsh and p2tr -- these cost ~48 fewer bytes (save 64 byte allocation on 64bit system, lose 8 bytes for both scriptSig and sPK) > * spends of nested p2wsh -- presumably save ~96 bytes, since the scriptSig would save an allocation, but I'm bundling it in the previous section > > Based on mainnet.observer stats for 2025-05-08, p2wpkh is about 55% of txs, p2tr is about 28%, p2pkh about 13%, p2wsh about 4% and the rest is noise, maybe? Those numbers net out to a saving of ~5.5 bytes per input. If p2wpkh rose from 55% to 80% and p2tr dropped to 20%, that would net to wasting ~3.2 bytes per input. ACKs for top commit: maflcko: review ACK d5104cf 🐺 achow101: reACK d5104cf jonatack: Review ACK d5104cf andrewtoth: ACK d5104cf Tree-SHA512: 7c5271ebaf4f6d91dc4b679ecbde4b7d0467579f072289f30da988a17c38a552d0b8cdf0e9c001739975dd019894c35e541908571527916cec56e04a8e214ae2
Thanks for reviewing and reproducing #32279 - it's also merged 🎉
Reviews and ACKs for the above 2 remaining ones would be very welcome. |
The remaining ones don't speed up IBD, do they? |
The big ones are merged, thanks for your help. Edit: rebased #31645 which does speed up a critical section of IBD measurably. |
We're calling CBlockHeader::GetHash hundreds of times for the same CBlockHeader object from ActivateBestChain. I chased the calls with this commit and found all the results where from ProcessNewBlock ActivateBestChain |
Thanks @pstratem, I did almost exactly the same, the main problem is indeed the loop in https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L3498, probably bounded by |
I have compared v29.1rc1 with the latest master branch containing these optimizations:
Both branches still have
The performance will improve further after we will adjust it before the release. Running a i7, hdd, -reindex-chainstateCOMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
--sort command \
--runs 2 \
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 10" \
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
Time (mean ± σ): 29308.463 s ± 72.560 s [User: 41477.069 s, System: 1134.925 s]
Range (min … max): 29257.156 s … 29359.771 s 2 runs
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
Time (mean ± σ): 26780.313 s ± 240.625 s [User: 38463.008 s, System: 1017.529 s]
Range (min … max): 26610.166 s … 26950.460 s 2 runs
Relative speed comparison
1.09 ± 0.01 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721) Running a full IBD (3 runs per commit) until block 909090 with 5GB memory reveals a 21% speedup (9h:06m vs 7h:22m)! The difference comes mostly from the lack of block writes (and fewer block reads) during reindexes. i9, SSD, full IBDCOMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
--sort command \
--runs 3 \
--export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 10" \
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
Time (mean ± σ): 32853.378 s ± 31.353 s [User: 54364.406 s, System: 2245.790 s]
Range (min … max): 32817.179 s … 32871.937 s 3 runs
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
Time (mean ± σ): 27046.236 s ± 631.042 s [User: 49185.359 s, System: 2613.586 s]
Range (min … max): 26545.657 s … 27755.087 s 3 runs
Relative speed comparison
1.21 ± 0.03 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721) There are 2 optimizations that we could still include in the next release, reviewers could help with them: |
During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.
Summary: 18% full IBD speedup
We don't have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes.
Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered.
The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.
Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5
Related issues:
PRs included here (in review priority order):
UndoWriteToDisk
andWriteBlockToDisk
to reduce serialization calls #31490AutoFile
serialization #31551ReadBlock
#32487P2WSH
/P2TR
/P2PK
scripts inline #32279Changing trends
The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load.

Profiling IBD, given these circumstances, revealed many new optimization opportunities.
Similar efforts in the past years
There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:
UndoWriteToDisk
andWriteBlockToDisk
to reduce serialization calls #31490, refactor: migratebool GetCoin
to returnoptional<Coin>
#30849, refactor: prohibit direct flags access in CCoinsCacheEntry and remove invalid tests #30906 - refactors derisking/enabling follow-up optimizationsReliable macro benchmarks
The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.
To make sure the setup reflected a real user's experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark).
To reduce the instability of headers synchronization and peer acquisition, we first started
bitcoind
until block 1, followed by the actual benchmarks until block 886'000.The top 2 PRs (#31551 and #31144) were measured together by multiple people with different settings (and varying results):
Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.
Current changes (in order of importance, reviews and reproducers are welcome):
Plotting the performance of the blocks from the produced
debug.log
files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:debug.log visualizer
Baseline
Base commit was 88debb3.
8.59 hour IBD time
AutoFile
serialization #315517.91 hour IBD time
We can serialize the blocks and undos to any
Stream
which implements the appropriate read/write methods.AutoFile
is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).Differential flame graphs indicate that the before/after speed change is because of fewer


AutoFile
reads and writes:7.60 hour IBD time
Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.
7.50 hour IBD time
When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush.
While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).
Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:

7.41 hour IBD time
The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.
Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.
7.31 hour IBD time
CheckBlock
's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.This PR improves performance and maintainability by introducing the following changes:
std::set
with sortedstd::vector
for improved locality.7.25 hour IBD time
The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.
Hashing a
uint256
key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.P2WSH
/P2TR
/P2PK
scripts inline #32279The current
prevector
size of 28 bytes (chosen to fill thesizeof(CScript)
aligned size) was introduced in 2015 (#6914) before SegWit and TapRoot.However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.
The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all
CScript
objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).Other similar efforts waiting for reviews or revives (not included in this tracking PR):
This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs.
Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.