[IBD] Tracking PR for speeding up Initial Block Download #32043

l0rinc · 2025-03-12T16:20:16Z

During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

Summary: 18% full IBD speedup

We don't have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes.
Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered.
The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

Related issues:

PRs included here (in review priority order):

Changing trends

The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load.

Profiling IBD, given these circumstances, revealed many new optimization opportunities.

Similar efforts in the past years

There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:

Add pool based memory resource #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
Drop -dbcache limit #28358 - allow the full UTXO set to fit into memory
Don't empty dbcache on prune flushes: >30% faster IBD #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
dbwrapper: Bump LevelDB max file size to 32 MiB to avoid system slowdown from high disk cache flush rate #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
refactor: inline UndoWriteToDisk and WriteBlockToDisk to reduce serialization calls #31490, refactor: migrate bool GetCoin to return optional<Coin> #30849, refactor: prohibit direct flags access in CCoinsCacheEntry and remove invalid tests #30906 - refactors derisking/enabling follow-up optimizations
optimization: Reduce cache lookups in CCoinsViewCache::FetchCoin #30326 - favor the happy path for cache misses (~2% IBD speedup)
streams: cache file position within AutoFile #30884 - Windows regression fix

Reliable macro benchmarks

The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

To make sure the setup reflected a real user's experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark).
To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

The top 2 PRs (#31551 and #31144) were measured together by multiple people with different settings (and varying results):

@andrewtoth in [IBD] multi-byte block obfuscation #31144 (comment) - 9% speedup with GCC
@hodlinator in [IBD] multi-byte block obfuscation #31144 (comment) - 11.9% speedup with GCC
@mlori in [IBD] multi-byte block obfuscation #31144 (comment) - 12% faster with GCC
@Sjors in [IBD] multi-byte block obfuscation #31144 (comment) - 3% speedup with Clang

Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

Current changes (in order of importance, reviews and reproducers are welcome):

Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

debug.log visualizer

import os
import sys
import re
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np


def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
        return

    debug_files = [f for f in os.listdir(log_dir) if
                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
    if not debug_files:
        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
        return

    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
    results = {}

    for filename in debug_files:
        filepath = os.path.join(log_dir, filename)
        print(f"Processing {filename}...", file=sys.stderr)

        update_tips = []
        first_timestamp = None
        line_count = tip_count = 0
        found_shutdown_done = False

        try:
            with open(filepath, 'r', errors='ignore') as file:
                for line_number, line in enumerate(file, 1):
                    line_count += 1
                    if line_count % 100000 == 0:
                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)

                    if not found_shutdown_done:
                        if "Shutdown: done" in line:
                            found_shutdown_done = True
                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
                                  file=sys.stderr)
                        continue

                    if len(line) < 20 or "UpdateTip:" not in line:
                        continue

                    try:
                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
                        height_match = height_pattern.search(line)
                        if not height_match:
                            continue

                        height = int(height_match.group(1))
                        if first_timestamp is None:
                            first_timestamp = timestamp

                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
                        tip_count += 1
                    except ValueError:
                        continue
        except Exception as e:
            print(f"Error processing {filename}: {e}", file=sys.stderr)
            continue

        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)

        if update_tips:
            time_dict = {}
            for time, height in update_tips:
                time_dict[time] = height
            results[filename[6:14]] = sorted(time_dict.items())

    if not results:
        print("No valid data found in any files.", file=sys.stderr)
        return

    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)

    sorted_results = []
    for name, pairs in results.items():
        if pairs:
            sorted_results.append((name, pairs[-1][0] / 3600, pairs))

    sorted_results.sort(key=lambda x: x[1], reverse=True)
    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))

    # Plot 1: Height vs Time
    plt.figure(figsize=(12, 8))

    final_points = []
    for idx, (name, last_time, pairs) in enumerate(sorted_results):
        times = [t / 3600 for t, _ in pairs]
        heights = [h for _, h in pairs]
        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
        if pairs:
            final_points.append((last_time, pairs[-1][1], colors[idx]))

    for time, height, color in final_points:
        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)

    plt.title('Sync Time by Block Height')
    plt.xlabel('Block Height')
    plt.ylabel('Elapsed Time (hours)')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(loc='center left')
    plt.tight_layout()

    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)

    # Plot 2: Performance Ratio by Time
    if len(sorted_results) > 1:
        plt.figure(figsize=(12, 8))

        baseline = sorted_results[0]
        baseline_time_by_height = {h: t for t, h in baseline[2]}

        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
            time_by_height = {h: t for t, h in pairs}

            common_heights = [h for h in baseline_time_by_height.keys()
                              if h >= 400000 and h in time_by_height]
            common_heights.sort()

            ratios = []
            base_times = []

            for h in common_heights:
                base_t = baseline_time_by_height[h]
                result_t = time_by_height[h]

                if result_t > 0:
                    ratios.append(base_t / result_t)
                    base_times.append(base_t / 3600)

            plt.plot(base_times, ratios,
                     label=f"{name} vs {baseline[0]}",
                     color=colors[idx], linewidth=1)

        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)

        plt.title('Performance Improvement Over Time (Higher is Better)')
        plt.xlabel('Baseline Elapsed Time (hours)')
        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.legend(loc='best')
        plt.tight_layout()

        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)

    with open(output_file.replace('.png', '.csv'), 'w') as f:
        for name, _, pairs in sorted_results:
            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")

    plt.show()


if __name__ == "__main__":
    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
    process_log_files_and_plot(log_dir, output_file)

Baseline

Base commit was 88debb3.

8.59 hour IBD time

COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
  Range (min … max):   30821.671 s … 31043.549 s    2 runs

[IBD] batch block reads/writes during AutoFile serialization #31551

7.91 hour IBD time

COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
  Range (min … max):   28416.815 s … 28586.361 s    2 runs

We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods.
AutoFile is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes:

[IBD] multi-byte block obfuscation #31144

7.60 hour IBD time

COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
  Range (min … max):   26994.075 s … 27794.346 s    2 runs

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.

[IBD] coins: increase default UTXO flush batch size to 32 MiB #31645

7.50 hour IBD time

COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
  Range (min … max):   26939.649 s … 27098.522 s    2 runs

When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush.
While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).

Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:

[IBD] specialize block serialization #31868

7.41 hour IBD time

COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
  Range (min … max):   26538.843 s … 26884.077 s    2 runs

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.

[IBD] specialize CheckBlock's input & coinbase checks #31682

7.31 hour IBD time

COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
  Range (min … max):   26294.420 s … 26359.314 s    2 runs

CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

This PR improves performance and maintainability by introducing the following changes:

Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
Optimized the general case by replacing std::set with sorted std::vector for improved locality.
Simplified Null prevout checks from linear to constant time.

[IBD] precalculate SipHash constant salt XORs #30442

7.25 hour IBD time

COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
  Range (min … max):   25749.536 s … 26419.323 s    2 runs

The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.

[IBD] prevector: store P2WSH/P2TR/P2PK scripts inline #32279

The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (#6914) before SegWit and TapRoot.
However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).

Other similar efforts waiting for reviews or revives (not included in this tracking PR):

validation: fetch block inputs on parallel threads 10% faster IBD #31132 - pre-warms the in-memory cache on multiple threads (10% IBD speedup for small in-memory caches)
validation: write chainstate to disk every hour #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
sync: improve CCoinsViewCache ReallocateCache #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
Don't wipe coins cache when full and instead evict LRU clean entries #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
Draft: CCoinMap Experiments #32128 - draft PR showcasing a few other possible caching speedups

This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs.
Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.

DrahtBot · 2025-03-12T16:20:20Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	jonatack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32521 (policy: make pathological transactions packed with legacy sigops non-standard by darosior)
#32457 (bench: replace benchmark block with more representative one (413567 → 784588) by l0rinc)
#32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
#32279 ([IBD] prevector: store P2WSH/P2TR/P2PK scripts inline by l0rinc)
#31868 ([IBD] specialize block serialization by l0rinc)
#31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
#31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
#31144 ([IBD] multi-byte block obfuscation by l0rinc)
#29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
#29307 (util: explicitly close all AutoFiles that have been written by vasild)
#28531 (improve MallocUsage() accuracy by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ryanofsky · 2025-03-12T17:14:11Z

Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.

One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:

DrahtBot · 2025-03-12T18:16:24Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/38650495272}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

ajtowns · 2025-03-13T06:18:10Z

Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

Wouldn't these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple "lower is better" comparison? (rather than "further left is better")

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)

l0rinc · 2025-03-13T09:16:37Z

with a simple "lower is better" comparison

I like the idea, updated the description and the code:

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in

Looks even funnier without the >400k cap:

ajtowns · 2025-03-13T12:45:58Z

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in

Fair; that might work better with a time on the x-axis rather than height, something like:

for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
    for height in range(1,850000):
        y = time_baseline[height] / time_commit[height] # keep this one the same
        x = time_baseline[height]  # changed from x = height
        add_datapoint(x,y)

l0rinc · 2025-03-13T13:08:04Z

time on the x-axis rather than height

~~If we do that we don't even need to filter out the first 400k blocks since they're so insignificant.~~
Edit: you're right, it looks better to filter those out - I've updated the description with the images and code.

jonatack · 2025-03-13T15:52:47Z

Concept ACK, thank you for opening this.

I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

Opened #32051 today to address an issue I'm also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.

Endianness doesn't affect the final size, we can skip it for `SizeComputer`. We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 191,652.29 | 5,217.78 | 0.4% | 10.96 | `SerializeBlock` | 10,323.55 | 96,865.92 | 0.2% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 614,847.32 | 1,626.42 | 0.0% | 8,015,883.64 | 2,207,628.07 | 3.631 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 26,020.31 | 38,431.52 | 0.0% | 159,390.03 | 93,438.33 | 1.706 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector). It makes sense to avoid the generalized serialization infrastructure that isn't needed: * AutoFile write doesn't need to allocate 4k buffer for a single byte now; * `VectorWriter` and `DataStream` avoids memcpy/insert calls. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 174,569.19 | 5,728.39 | 0.6% | 10.89 | `SerializeBlock` | 10,241.16 | 97,645.21 | 0.0% | 11.00 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 615,000.56 | 1,626.01 | 0.0% | 8,015,883.64 | 2,208,340.88 | 3.630 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 25,676.76 | 38,945.72 | 0.0% | 159,390.03 | 92,202.10 | 1.729 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs. This update to the tests ensures that: * Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios. * Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values. * Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks. Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness. To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 372,743.63 | 2,682.81 | 1.1% | 10.99 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,304,694.54 | 302.60 | 0.5% | 11.05 | `DuplicateInputs` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,096,261.84 | 912.19 | 0.1% | 7,963,390.88 | 3,487,375.26 | 2.283 | 1,266,941.00 | 1.8% | 11.03 | `CheckBlockBench` | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 8,366,309.48 | 119.53 | 0.0% | 23,865,177.67 | 26,620,160.23 | 0.897 | 5,972,887.41 | 4.0% | 10.78 | `DuplicateInputs`

The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context. Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 9,585.10 | 104,328.55 | 0.1% | 11.03 | `ProcessTransactionBench` > C++ compiler .......................... GNU 13.3.0 | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 56,199.57 | 17,793.73 | 0.1% | 229,263.01 | 178,766.31 | 1.282 | 15,509.97 | 0.5% | 10.91 | `ProcessTransactionBench`

`IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only. The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in). I've added braces to the conditions and loops to simplify review of followup commits. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 335,917.12 | 2,976.92 | 1.3% | 11.01 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,286,337.42 | 304.29 | 1.1% | 10.90 | `DuplicateInputs` | 9,561.02 | 104,591.35 | 0.2% | 11.02 | `ProcessTransactionBench`

No need to create a set for checking duplicates for two-input-transactions. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 314,137.30 | 3,183.32 | 1.2% | 11.04 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,220,592.73 | 310.50 | 1.3% | 10.92 | `DuplicateInputs` | 9,425.98 | 106,089.77 | 0.3% | 11.00 | `ProcessTransactionBench`

A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks. It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 181,922.54 | 5,496.85 | 0.2% | 10.98 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 997,739.30 | 1,002.27 | 1.0% | 10.94 | `DuplicateInputs` | 9,449.28 | 105,828.15 | 0.3% | 10.99 | `ProcessTransactionBench` Co-authored-by: Pieter Wuille <pieter@wuille.net>

For the 2 input case we simply check them both, like we did with equality. For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000 > C++ compiler .......................... AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 179,971.00 | 5,556.45 | 0.3% | 11.02 | `CheckBlockBench` | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 963,177.98 | 1,038.23 | 1.7% | 10.92 | `DuplicateInputs` | 9,410.90 | 106,259.75 | 0.3% | 11.01 | `ProcessTransactionBench` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 834,855.94 | 1,197.81 | 0.0% | 6,518,548.86 | 2,656,039.78 | 2.454 | 919,160.84 | 1.5% | 10.78 | `CheckBlockBench` | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 4,261,492.75 | 234.66 | 0.0% | 17,379,823.40 | 13,559,793.33 | 1.282 | 4,265,714.28 | 3.4% | 11.00 | `DuplicateInputs` | 55,819.53 | 17,914.88 | 0.1% | 227,828.15 | 177,520.09 | 1.283 | 15,184.36 | 0.4% | 10.91 | `ProcessTransactionBench`

Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size: Direct (stack) allocated script types (size ≤ 28 bytes): * OP_RETURN (small) * P2WPKH * P2SH * P2PKH Indirect (heap) allocated script types (size > 28 bytes): * P2WSH * P2TR * P2PK * MULTISIG (small) This test provides a baseline for verifying changes to prevector's inline capacity.

The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (bitcoin#6914) before SegWit and TapRoot. However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage. The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`). Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads. Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost. Performance benchmarks for AssumeUTXO load and flush show: - Small dbcache (450MB): ~1% performance penalty due to more frequent flushes - Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup. Co-authored-by: Ava Chow <github@achow101.com> Co-authored-by: Andrew Toth <andrewstoth@gmail.com>

DrahtBot · 2025-05-16T09:28:10Z

🐙 This pull request conflicts with the target branch and needs rebase.

248b6a2 optimization: peel align-head and unroll body to 64 bytes (Lőrinc) e7114fc optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t` (Lőrinc) 478d40a refactor: encapsulate `vector`/`array` keys into `Obfuscation` (Lőrinc) 377aab8 refactor: move `util::Xor` to `Obfuscation().Xor` (Lőrinc) fa5d296 refactor: prepare mempool_persist for obfuscation key change (Lőrinc) 6bbf2d9 refactor: prepare `DBWrapper` for obfuscation key change (Lőrinc) 0b8bec8 scripted-diff: unify xor-vs-obfuscation nomenclature (Lőrinc) 9726979 bench: make ObfuscationBench more representative (Lőrinc) 618a30e test: compare util::Xor with randomized inputs against simple impl (Lőrinc) a5141cd test: make sure dbwrapper obfuscation key is never obfuscated (Lőrinc) 54ab0bd refactor: commit to 8 byte obfuscation keys (Lőrinc) 7aa557a random: add fixed-size `std::array` generation (Lőrinc) Pull request description: This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043) ### Summary Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks. Since this obfuscation is optional, the speedup measured here depends on whether it's a [random value](#31144 (comment)) or [completely turned off](#31144 (comment)) (i.e. XOR-ing with 0). ### Changes in testing, benchmarking and implementation * Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks. * Migrated `std::vector<std::byte>(8)` keys to plain `uint64_t`; * Process unaligned bytes separately and unroll body to 64 bytes. ### Assembly Memory alignment is enforced by a small peel-loop (`std::memcpy` is optimized out on tested platform), with an `std::assume_aligned<8>` check, see the Godbolt listing at https://godbolt.org/z/59EMv7h6Y for details <details> <summary>Details</summary> Target & Compiler | Stride (per hot-loop iter) | Main operation(s) in loop | Effective XORs / iter -- | -- | -- | -- Clang x86-64 (trunk) | 64 bytes | 4 × movdqu → pxor → store | 8 × 64-bit GCC x86-64 (trunk) | 64 bytes | 4 × movdqu/pxor sequence, enabled by 8-way unroll | 8 × 64-bit GCC RV32 (trunk) | 8 bytes | copy 8 B to temp → 2 × 32-bit XOR → copy back | 1 × 64-bit (as 2 × 32-bit) GCC s390x (big-endian 14.2) | 64 bytes | 8 × XC (mem-mem 8-B XOR) with key cached on stack | 8 × 64-bit </details> ### Endianness The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker. <details> <summary>Details</summary> ```bash brew install podman pigz softwareupdate --install-rosetta podman machine init podman machine start docker run --platform linux/s390x -it ubuntu:latest /bin/bash apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \ cd /mnt && git clone --depth=1 https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \ cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \ ./build/bin/test_bitcoin --run_test=streams_tests ``` </details> ### Measurements (micro benchmarks and full IBDs) > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc/clang -DCMAKE_CXX_COMPILER=g++/clang++ && \ cmake --build build -j$(nproc) && \ build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 <details> <summary>GNU 14.2.0</summary> > Before: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.84 | 1,184,138,235.64 | 0.0% | 9.01 | 3.03 | 2.971 | 1.00 | 0.1% | 5.50 | `ObfuscationBench` > After (first optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.04 | 28,365,698,819.44 | 0.0% | 0.34 | 0.13 | 2.714 | 0.07 | 0.0% | 5.33 | `ObfuscationBench` > and (second optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.03 | 32,464,658,919.11 | 0.0% | 0.50 | 0.11 | 4.474 | 0.08 | 0.0% | 5.29 | `ObfuscationBench` </details> <details> <summary>Clang 20.1.7</summary> > Before: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.89 | 1,124,087,330.23 | 0.1% | 6.52 | 3.20 | 2.041 | 0.50 | 0.2% | 5.50 | `ObfuscationBench` > After (first optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.08 | 13,012,464,203.00 | 0.0% | 0.65 | 0.28 | 2.338 | 0.13 | 0.8% | 5.50 | `ObfuscationBench` > and (second optimizing commit): | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 0.02 | 41,231,547,045.17 | 0.0% | 0.30 | 0.09 | 3.463 | 0.02 | 0.0% | 5.47 | `ObfuscationBench` </details> i.e. 27.4x faster obfuscation with GCC, 36.7x faster with Clang For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144 ------ Running an IBD until 888888 blocks reveals a 4% speedup. <details> <summary>Details</summary> SSD: ```bash COMMITS="8324a00bd4a6a5291c841f2d01162d8a014ddb02 5ddfd31"; \ STOP_HEIGHT=888888; DBCACHE=1000; \ CC=gcc; CXX=g++; \ BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \ (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \ hyperfine \ --sort 'command' \ --runs 1 \ --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \ --parameter-list COMMIT ${COMMITS// /,} \ --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && \ cmake --build build -j$(nproc) --target bitcoind && \ ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \ --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \ "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0" ``` > 8324a00 test: Compare util::Xor with randomized inputs against simple impl > 5ddfd31 optimization: Xor 64 bits together instead of byte-by-byte ```python Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00) Time (abs ≡): 25033.413 s [User: 33953.984 s, System: 2613.604 s] Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31) Time (abs ≡): 24110.710 s [User: 33389.536 s, System: 2660.292 s] Relative speed comparison 1.04 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00) 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31) ``` > HDD: ```bash COMMITS="71eb6eaa740ad0b28737e90e59b89a8e951d90d9 4685403"; \ STOP_HEIGHT=888888; DBCACHE=4500; \ CC=gcc; CXX=g++; \ BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \ (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \ hyperfine \ --sort 'command' \ --runs 2 \ --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \ --parameter-list COMMIT ${COMMITS// /,} \ --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \ ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \ --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \ "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0" ``` > 71eb6ea test: compare util::Xor with randomized inputs against simple impl > 4685403 optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t` ```python Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea) Time (mean ± σ): 37676.293 s ± 83.100 s [User: 36900.535 s, System: 2220.382 s] Range (min … max): 37617.533 s … 37735.053 s 2 runs Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403) Time (mean ± σ): 36181.287 s ± 195.248 s [User: 34962.822 s, System: 1988.614 s] Range (min … max): 36043.226 s … 36319.349 s 2 runs Relative speed comparison 1.04 ± 0.01 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea) 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403) ``` </details> ACKs for top commit: achow101: ACK 248b6a2 maflcko: review ACK 248b6a2 🎻 ryanofsky: Code review ACK 248b6a2. Looks good! Thanks for adapting this and considering all the suggestions. I did leave more comments below but non are important and this looks good as-is Tree-SHA512: ef541cd8a1f1dc504613c4eaa708202e32ae5ac86f9c875e03bcdd6357121f6af0860ef83d513c473efa5445b701e59439d416effae1085a559716b0fd45ecd6

l0rinc · 2025-07-20T01:56:09Z

We're making progress, #31144 was just merged! 🎉
The next ones that need some love are:

…line d5104cf prevector: store `P2WSH`/`P2TR`/`P2PK` scripts inline (Lőrinc) 5212150 test: assert `CScript` allocation characteristics (Lőrinc) 65ac7f6 refactor: modernize `CScriptBase` definition (Lőrinc) 756da2a refactor: extract `STATIC_SIZE` constant to prevector (Lőrinc) Pull request description: This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043) ### Summary The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (#6914) before `SegWit` and `TapRoot`. However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage. The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`). ### Context Increasing the `prevector` size allows these scripts to be stored inline, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads. Due to memory alignment, increasing the prevector size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost. <details> <summary>Massif measurements</summary> > dbcache=440 Massif before, with a heap threshold of `28`: ```bash MB 744.1^# |#: ::::::@: ::::::: :@:: @::::::::::::::@@ |#: ::::::@::::: ::: :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@ |#::::::::@::::: ::: : :@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: ::: :::@:::@:::::: :: ::::@ |#::::::::@::::: :::::::@:::@:::::: :: ::::@ |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@:::: 0 +----------------------------------------------------------------------->h 0 1.805 ``` and after, with a heap threshold of `36`: ```bash MB 744.2^ : |# : ::::::::::: : : :: ::: @@:::::: :: : |# : :::: :::::: : : :: ::: @ :: :: : : |# : :::: ::::::: : :@:: ::: @ :: :: : ::: |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# : :::: ::::::: : :@:: ::: @ :: :: : : : |# :: :::: ::::::: : :@:: ::: @ :: :: : : : |# :: :::: ::::::: : :@:: ::::@ :: :: : : : |#:::: :::: ::::::: :::@:: ::::@ :: :: : : : |#: ::::::: ::::::: :::@:: ::::@ :: :: @: : : |#: ::::::: ::::::: :::@:::::::@ :: :: @: : : |#: ::::::: ::::::::::::@:::::::@ :: :: @: : : |#: ::::::: :::::::: :::@:::::::@ :: :: @: : : |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@:: 0 +----------------------------------------------------------------------->h 0 1.618 ``` --- > for `dbcache=4500`: Massif before, with a heap threshold of `28`: ```bash GB 4.565^ :: | ##: @@::: :::: :@:::: :::: :::: | # : @ :: ::: :@: :: : :: ::: | # : @ :: ::::: :@: :: : :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : @ :: : ::: :@: :: @: :: ::: | # : ::@ :: : ::: :@: :: @: :: ::: | # : : @ :: : ::: :@: :: @: :: ::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: :@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: | # : : @ :: : ::: ::@: :: @: :::::: @:: | # : : @ :: : ::: ::@: :: @: :::::: @: | # : : @ :: : ::: ::@: :::@: :::::: @: | # : : @ :: : ::: ::@: :::@: :::::: @: :::::::::::::::::::::::::::::@::: 0 +----------------------------------------------------------------------->h 0 1.500 ``` and after, with a heap threshold of `36`: ``` GB 4.640^ : | ##:: ::::: :::: ::::::@ :::: | # :: : ::: :::: :: :::@ :::: | # :: :: ::: :::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: | # :: :: ::: ::::: :: :::@ :::: :@@ | # :: :: ::: ::::: ::: :::@ :::::::@ | # :: :: ::: ::::: ::: :::@ ::::: :@ | # :: :: ::: ::::: ::: :::@::::::: :@ | # ::::: ::: ::::: ::: :::@: ::::: :@ | # ::::: ::: ::::: ::: :::@: ::::: :@ | # ::::: :::::::::: ::: :::@: ::::: :@ | # ::::: :::: ::::: ::: :::@: ::::: :@ | # ::::: :::: ::::: ::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ | # ::::: :::: ::::::::: :::@: ::::: :@ ::::::@:::@:::@::::@:::::@::::@:: 0 +----------------------------------------------------------------------->h 0 1.360 ``` </details> ### Benchmarks and Memory Performance benchmarks for `AssumeUTXO` load and flush show: - Small dbcache (450MB): ~1-3% performance improvement (despite more frequent flushes) - Large dbcache (4500MB): ~6-8% performance improvement due to fewer heap allocations (and basically the number of flushes) - Very large dbcache (4500MB): ~5-6% performance improvement due to fewer heap allocations (and memory limit not being reached, so there's no memory penalty) Full IBD and `-reindex-chainstate` with also show an overall ~3-4% speedup (both for smaller and larger dbcache values). We haven't investigated using different `prevector` sizes based on script type, though this could be explored in the future if needed. ### Historical explanation for the speedup (by [Anthony Towns](#32279 (comment))) > I think the tradeoff is something like: > > * spends of p2pk, p2sh, p2pkh coins -- these cost 8 more bytes > * spends of p2wpkh -- these cost 16 more bytes (sPK and scriptSig didn't need an allocation) > * spends of p2wsh and p2tr -- these cost ~48 fewer bytes (save 64 byte allocation on 64bit system, lose 8 bytes for both scriptSig and sPK) > * spends of nested p2wsh -- presumably save ~96 bytes, since the scriptSig would save an allocation, but I'm bundling it in the previous section > > Based on mainnet.observer stats for 2025-05-08, p2wpkh is about 55% of txs, p2tr is about 28%, p2pkh about 13%, p2wsh about 4% and the rest is noise, maybe? Those numbers net out to a saving of ~5.5 bytes per input. If p2wpkh rose from 55% to 80% and p2tr dropped to 20%, that would net to wasting ~3.2 bytes per input. ACKs for top commit: maflcko: review ACK d5104cf 🐺 achow101: reACK d5104cf jonatack: Review ACK d5104cf andrewtoth: ACK d5104cf Tree-SHA512: 7c5271ebaf4f6d91dc4b679ecbde4b7d0467579f072289f30da988a17c38a552d0b8cdf0e9c001739975dd019894c35e541908571527916cec56e04a8e214ae2

l0rinc · 2025-07-28T20:58:53Z

Thanks for reviewing and reproducing #32279 - it's also merged 🎉

Reviews and ACKs for the above 2 remaining ones would be very welcome.

maflcko · 2025-07-29T07:02:30Z

The remaining ones don't speed up IBD, do they?

l0rinc · 2025-07-29T07:05:59Z

The big ones are merged, thanks for your help.
These are smaller ones - at least from an IBD perspective -, but both are quite simple.

Edit: rebased #31645 which does speed up a critical section of IBD measurably.

pstratem · 2025-08-01T22:07:36Z

We're calling CBlockHeader::GetHash hundreds of times for the same CBlockHeader object from ActivateBestChain.

I chased the calls with this commit and found all the results where from ProcessNewBlock ActivateBestChain

pstratem@7de2330

l0rinc · 2025-08-01T22:36:57Z

Thanks @pstratem, I did almost exactly the same, the main problem is indeed the loop in https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L3498, probably bounded by BLOCK_DOWNLOAD_WINDOW - hence the ~1000 worst cases.
I have a fix for most duplications, I'm running an IBD (and a reindex-chainstate) on my benchmarking servers to see the effect of the deduplications. I expect at most a 1-2% change, but it may still be worth it.
I'll also investigate the effect of lowering BLOCK_DOWNLOAD_WINDOW.

l0rinc · 2025-08-18T04:27:01Z

I have compared v29.1rc1 with the latest master branch containing these optimizations:

Both branches still have defaultAssumeValid of 886'157:

bitcoin/src/kernel/chainparams.cpp

Line 121 in 565af03

consensus.defaultAssumeValid = uint256{"00000000000000000001b658dd1120e82e66d2790811f89ede9742ada3ed6d77"}; // 886157
bitcoin/src/kernel/chainparams.cpp

Line 122 in dadf15f

consensus.defaultAssumeValid = uint256{"00000000000000000001b658dd1120e82e66d2790811f89ede9742ada3ed6d77"}; // 886157

The performance will improve further after we will adjust it before the release.

Running a reindex-chainstate benchmark until block 909090 with 5GB memory reveals a 9% speedup (8h:07m vs 7h:23m)!

i7, hdd, -reindex-chainstate

COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

565af03 Merge #33056: [29.x] final changes for v29.1rc1
dadf15f Merge #33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     29308.463 s ± 72.560 s    [User: 41477.069 s, System: 1134.925 s]
  Range (min … max):   29257.156 s … 29359.771 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     26780.313 s ± 240.625 s    [User: 38463.008 s, System: 1017.529 s]
  Range (min … max):   26610.166 s … 26950.460 s    2 runs
  
Relative speed comparison
        1.09 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

Running a full IBD (3 runs per commit) until block 909090 with 5GB memory reveals a 21% speedup (9h:06m vs 7h:22m)! The difference comes mostly from the lack of block writes (and fewer block reads) during reindexes.

i9, SSD, full IBD

COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 3 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

565af03 Merge #33056: [29.x] final changes for v29.1rc1
dadf15f Merge #33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     32853.378 s ± 31.353 s    [User: 54364.406 s, System: 2245.790 s]
  Range (min … max):   32817.179 s … 32871.937 s    3 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     27046.236 s ± 631.042 s    [User: 49185.359 s, System: 2613.586 s]
  Range (min … max):   26545.657 s … 27755.087 s    3 runs
 
Relative speed comparison
        1.21 ±  0.03  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

There are 2 optimizations that we could still include in the next release, reviewers could help with them:

DrahtBot changed the title ~~[IBD] - Tracking PR for speeding up Initial Block Download~~ [IBD] - Tracking PR for speeding up Initial Block Download Mar 12, 2025

l0rinc changed the title ~~[IBD] - Tracking PR for speeding up Initial Block Download~~ [IBD] Tracking PR for speeding up Initial Block Download Mar 12, 2025

laanwj added Block storage P2P labels Mar 12, 2025

l0rinc force-pushed the l0rinc/IBD-optimizations branch from 6f0ac8b to 0f6a8e4 Compare March 12, 2025 18:16

DrahtBot added the CI failed label Mar 12, 2025

l0rinc force-pushed the l0rinc/IBD-optimizations branch from 0f6a8e4 to a06a674 Compare March 12, 2025 19:53

DrahtBot removed the CI failed label Mar 12, 2025

This was referenced Mar 13, 2025

init: Take lock on blocks directory in BlockManager ctor #31860

Draft

refactor: Use std::span over Span #31519

Merged

This was referenced Mar 13, 2025

kernel, logging: Pass Logger instances to kernel objects #30342

Draft

kernel, refactor: return error status on all fatal errors #29700

Draft

scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] #29641

Draft

DrahtBot mentioned this pull request Mar 19, 2025

refactor: Improve assumeutxo state representation #30214

Open

DrahtBot added the Needs rebase label Mar 20, 2025

johnny9 mentioned this pull request Mar 23, 2025

Socratic Seminar 39 TriangleBitDevs/TriangleBitDevs.github.io#36

Closed

l0rinc and others added 11 commits April 17, 2025 11:30

l0rinc force-pushed the l0rinc/IBD-optimizations branch from e751cba to b6b4235 Compare April 17, 2025 09:40

DrahtBot removed the Needs rebase label Apr 17, 2025

This was referenced Apr 17, 2025

refactor: reenable implicit-integer-sign-change check for serialize.h #32296

Merged

improve MallocUsage() accuracy #28531

Draft

l0rinc mentioned this pull request May 5, 2025

SwiftSync l0rinc/bitcoin#11

Draft

This was referenced May 9, 2025

bench: replace benchmark block with more representative one (413567 → 784588) #32457

Closed

policy: make pathological transactions packed with legacy sigops non-standard #32521

Merged

DrahtBot added the Needs rebase label May 16, 2025

l0rinc mentioned this pull request Jun 30, 2025

[IBD] Raspberry Pi: 90% CPU time for 1.5% of block processing #32832

Open

[IBD] Tracking PR for speeding up Initial Block Download #32043

Are you sure you want to change the base?

[IBD] Tracking PR for speeding up Initial Block Download #32043

Uh oh!

Conversation

l0rinc commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary: 18% full IBD speedup

Related issues:

PRs included here (in review priority order):

Changing trends

Similar efforts in the past years

Reliable macro benchmarks

Current changes (in order of importance, reviews and reproducers are welcome):

Baseline

Other similar efforts waiting for reviews or revives (not included in this tracking PR):

Uh oh!

DrahtBot commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

ryanofsky commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Mar 12, 2025

Uh oh!

ajtowns commented Mar 13, 2025

Uh oh!

l0rinc commented Mar 13, 2025

Uh oh!

ajtowns commented Mar 13, 2025

Uh oh!

l0rinc commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonatack commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented May 16, 2025

Uh oh!

l0rinc commented Jul 20, 2025

Uh oh!

l0rinc commented Jul 28, 2025

Uh oh!

maflcko commented Jul 29, 2025

Uh oh!

l0rinc commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pstratem commented Aug 1, 2025

Uh oh!

l0rinc commented Aug 1, 2025

Uh oh!

l0rinc commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

l0rinc commented Mar 12, 2025 •

edited

Loading

DrahtBot commented Mar 12, 2025 •

edited

Loading

ryanofsky commented Mar 12, 2025 •

edited

Loading

l0rinc commented Mar 13, 2025 •

edited

Loading

jonatack commented Mar 13, 2025 •

edited

Loading

l0rinc commented Jul 29, 2025 •

edited

Loading

l0rinc commented Aug 18, 2025 •

edited

Loading