Skip to content

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Mar 12, 2025

During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

Summary: 18% full IBD speedup

We don't have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes.
Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered.
The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

Related issues:

PRs included here (in review priority order):

Changing trends

The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load.
image
Profiling IBD, given these circumstances, revealed many new optimization opportunities.

Similar efforts in the past years

There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:

Reliable macro benchmarks

The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

To make sure the setup reflected a real user's experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark).
To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

The top 2 PRs (#31551 and #31144) were measured together by multiple people with different settings (and varying results):

Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

Current changes (in order of importance, reviews and reproducers are welcome):

Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

debug.log visualizer
import os
import sys
import re
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np


def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
        return

    debug_files = [f for f in os.listdir(log_dir) if
                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
    if not debug_files:
        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
        return

    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
    results = {}

    for filename in debug_files:
        filepath = os.path.join(log_dir, filename)
        print(f"Processing {filename}...", file=sys.stderr)

        update_tips = []
        first_timestamp = None
        line_count = tip_count = 0
        found_shutdown_done = False

        try:
            with open(filepath, 'r', errors='ignore') as file:
                for line_number, line in enumerate(file, 1):
                    line_count += 1
                    if line_count % 100000 == 0:
                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)

                    if not found_shutdown_done:
                        if "Shutdown: done" in line:
                            found_shutdown_done = True
                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
                                  file=sys.stderr)
                        continue

                    if len(line) < 20 or "UpdateTip:" not in line:
                        continue

                    try:
                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
                        height_match = height_pattern.search(line)
                        if not height_match:
                            continue

                        height = int(height_match.group(1))
                        if first_timestamp is None:
                            first_timestamp = timestamp

                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
                        tip_count += 1
                    except ValueError:
                        continue
        except Exception as e:
            print(f"Error processing {filename}: {e}", file=sys.stderr)
            continue

        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)

        if update_tips:
            time_dict = {}
            for time, height in update_tips:
                time_dict[time] = height
            results[filename[6:14]] = sorted(time_dict.items())

    if not results:
        print("No valid data found in any files.", file=sys.stderr)
        return

    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)

    sorted_results = []
    for name, pairs in results.items():
        if pairs:
            sorted_results.append((name, pairs[-1][0] / 3600, pairs))

    sorted_results.sort(key=lambda x: x[1], reverse=True)
    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))

    # Plot 1: Height vs Time
    plt.figure(figsize=(12, 8))

    final_points = []
    for idx, (name, last_time, pairs) in enumerate(sorted_results):
        times = [t / 3600 for t, _ in pairs]
        heights = [h for _, h in pairs]
        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
        if pairs:
            final_points.append((last_time, pairs[-1][1], colors[idx]))

    for time, height, color in final_points:
        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)

    plt.title('Sync Time by Block Height')
    plt.xlabel('Block Height')
    plt.ylabel('Elapsed Time (hours)')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(loc='center left')
    plt.tight_layout()

    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)

    # Plot 2: Performance Ratio by Time
    if len(sorted_results) > 1:
        plt.figure(figsize=(12, 8))

        baseline = sorted_results[0]
        baseline_time_by_height = {h: t for t, h in baseline[2]}

        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
            time_by_height = {h: t for t, h in pairs}

            common_heights = [h for h in baseline_time_by_height.keys()
                              if h >= 400000 and h in time_by_height]
            common_heights.sort()

            ratios = []
            base_times = []

            for h in common_heights:
                base_t = baseline_time_by_height[h]
                result_t = time_by_height[h]

                if result_t > 0:
                    ratios.append(base_t / result_t)
                    base_times.append(base_t / 3600)

            plt.plot(base_times, ratios,
                     label=f"{name} vs {baseline[0]}",
                     color=colors[idx], linewidth=1)

        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)

        plt.title('Performance Improvement Over Time (Higher is Better)')
        plt.xlabel('Baseline Elapsed Time (hours)')
        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.legend(loc='best')
        plt.tight_layout()

        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)

    with open(output_file.replace('.png', '.csv'), 'w') as f:
        for name, _, pairs in sorted_results:
            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")

    plt.show()


if __name__ == "__main__":
    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
    process_log_files_and_plot(log_dir, output_file)
image image

Baseline

Base commit was 88debb3.

8.59 hour IBD time
COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
  Range (min … max):   30821.671 s … 31043.549 s    2 runs

7.91 hour IBD time
COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
  Range (min … max):   28416.815 s … 28586.361 s    2 runs

We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods.
AutoFile is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes:
writes
reads


7.60 hour IBD time
COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
  Range (min … max):   26994.075 s … 27794.346 s    2 runs

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.

obfuscation calls during IBD without batching


7.50 hour IBD time
COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
  Range (min … max):   26939.649 s … 27098.522 s    2 runs

When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush.
While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).

image

Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:
image

image image image
7.41 hour IBD time
COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
  Range (min … max):   26538.843 s … 26884.077 s    2 runs

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.


7.31 hour IBD time
COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
  Range (min … max):   26294.420 s … 26359.314 s    2 runs

CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

This PR improves performance and maintainability by introducing the following changes:

  • Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
  • Optimized the general case by replacing std::set with sorted std::vector for improved locality.
  • Simplified Null prevout checks from linear to constant time.

7.25 hour IBD time
COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
  Range (min … max):   25749.536 s … 26419.323 s    2 runs

The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.


The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (#6914) before SegWit and TapRoot.
However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).

image


Other similar efforts waiting for reviews or revives (not included in this tracking PR):


This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs.
Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 12, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK jonatack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32521 (policy: make pathological transactions packed with legacy sigops non-standard by darosior)
  • #32457 (bench: replace benchmark block with more representative one (413567 → 784588) by l0rinc)
  • #32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
  • #32279 ([IBD] prevector: store P2WSH/P2TR/P2PK scripts inline by l0rinc)
  • #31868 ([IBD] specialize block serialization by l0rinc)
  • #31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
  • #31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
  • #31144 ([IBD] multi-byte block obfuscation by l0rinc)
  • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
  • #29307 (util: explicitly close all AutoFiles that have been written by vasild)
  • #28531 (improve MallocUsage() accuracy by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot changed the title [IBD] - Tracking PR for speeding up Initial Block Download [IBD] - Tracking PR for speeding up Initial Block Download Mar 12, 2025
@l0rinc l0rinc changed the title [IBD] - Tracking PR for speeding up Initial Block Download [IBD] Tracking PR for speeding up Initial Block Download Mar 12, 2025
@ryanofsky
Copy link
Contributor

ryanofsky commented Mar 12, 2025

Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.


One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/38650495272

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@ajtowns
Copy link
Contributor

ajtowns commented Mar 13, 2025

Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

Wouldn't these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple "lower is better" comparison? (rather than "further left is better")

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 13, 2025

with a simple "lower is better" comparison

I like the idea, updated the description and the code:
image

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in
image
Looks even funnier without the >400k cap:
image

@ajtowns
Copy link
Contributor

ajtowns commented Mar 13, 2025

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in

Fair; that might work better with a time on the x-axis rather than height, something like:

for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
    for height in range(1,850000):
        y = time_baseline[height] / time_commit[height] # keep this one the same
        x = time_baseline[height]  # changed from x = height
        add_datapoint(x,y)

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 13, 2025

time on the x-axis rather than height

If we do that we don't even need to filter out the first 400k blocks since they're so insignificant.
Edit: you're right, it looks better to filter those out - I've updated the description with the images and code.

@jonatack
Copy link
Member

jonatack commented Mar 13, 2025

Concept ACK, thank you for opening this.

I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

Opened #32051 today to address an issue I'm also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.

l0rinc and others added 11 commits April 17, 2025 11:30
Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs.

This update to the tests ensures that:
* Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios.
* Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values.
* Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks.

Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness.

To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.
> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          372,743.63 |            2,682.81 |    1.1% |     10.99 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,304,694.54 |              302.60 |    0.5% |     11.05 | `DuplicateInputs`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,096,261.84 |              912.19 |    0.1% |    7,963,390.88 |    3,487,375.26 |  2.283 |   1,266,941.00 |    1.8% |     11.03 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        8,366,309.48 |              119.53 |    0.0% |   23,865,177.67 |   26,620,160.23 |  0.897 |   5,972,887.41 |    4.0% |     10.78 | `DuplicateInputs`
The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context.

Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            9,585.10 |          104,328.55 |    0.1% |     11.03 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           56,199.57 |           17,793.73 |    0.1% |      229,263.01 |      178,766.31 |  1.282 |      15,509.97 |    0.5% |     10.91 | `ProcessTransactionBench`
`IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only.
The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in).
I've added braces to the conditions and loops to simplify review of followup commits.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          335,917.12 |            2,976.92 |    1.3% |     11.01 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,286,337.42 |              304.29 |    1.1% |     10.90 | `DuplicateInputs`
|            9,561.02 |          104,591.35 |    0.2% |     11.02 | `ProcessTransactionBench`
No need to create a set for checking duplicates for two-input-transactions.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          314,137.30 |            3,183.32 |    1.2% |     11.04 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,220,592.73 |              310.50 |    1.3% |     10.92 | `DuplicateInputs`
|            9,425.98 |          106,089.77 |    0.3% |     11.00 | `ProcessTransactionBench`
A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks.
It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          181,922.54 |            5,496.85 |    0.2% |     10.98 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          997,739.30 |            1,002.27 |    1.0% |     10.94 | `DuplicateInputs`
|            9,449.28 |          105,828.15 |    0.3% |     10.99 | `ProcessTransactionBench`

Co-authored-by: Pieter Wuille <pieter@wuille.net>
For the 2 input case we simply check them both, like we did with equality.

For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          179,971.00 |            5,556.45 |    0.3% |     11.02 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          963,177.98 |            1,038.23 |    1.7% |     10.92 | `DuplicateInputs`
|            9,410.90 |          106,259.75 |    0.3% |     11.01 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          834,855.94 |            1,197.81 |    0.0% |    6,518,548.86 |    2,656,039.78 |  2.454 |     919,160.84 |    1.5% |     10.78 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        4,261,492.75 |              234.66 |    0.0% |   17,379,823.40 |   13,559,793.33 |  1.282 |   4,265,714.28 |    3.4% |     11.00 | `DuplicateInputs`
|           55,819.53 |           17,914.88 |    0.1% |      227,828.15 |      177,520.09 |  1.283 |      15,184.36 |    0.4% |     10.91 | `ProcessTransactionBench`
Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size:

Direct (stack) allocated script types (size ≤ 28 bytes):
* OP_RETURN (small)
* P2WPKH
* P2SH
* P2PKH

Indirect (heap) allocated script types (size > 28 bytes):
* P2WSH
* P2TR
* P2PK
* MULTISIG (small)

This test provides a baseline for verifying changes to prevector's inline capacity.
The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (bitcoin#6914) before SegWit and TapRoot.
However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`).

Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads.

Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost.

Performance benchmarks for AssumeUTXO load and flush show:
- Small dbcache (450MB): ~1% performance penalty due to more frequent flushes
- Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations

Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup.

Co-authored-by: Ava Chow <github@achow101.com>
Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

achow101 added a commit that referenced this pull request Jul 19, 2025
248b6a2 optimization: peel align-head and unroll body to 64 bytes (Lőrinc)
e7114fc optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t` (Lőrinc)
478d40a refactor: encapsulate `vector`/`array` keys into `Obfuscation` (Lőrinc)
377aab8 refactor: move `util::Xor` to `Obfuscation().Xor` (Lőrinc)
fa5d296 refactor: prepare mempool_persist for obfuscation key change (Lőrinc)
6bbf2d9 refactor: prepare `DBWrapper` for obfuscation key change (Lőrinc)
0b8bec8 scripted-diff: unify xor-vs-obfuscation nomenclature (Lőrinc)
9726979 bench: make ObfuscationBench more representative (Lőrinc)
618a30e test: compare util::Xor with randomized inputs against simple impl (Lőrinc)
a5141cd test: make sure dbwrapper obfuscation key is never obfuscated (Lőrinc)
54ab0bd refactor: commit to 8 byte obfuscation keys (Lőrinc)
7aa557a random: add fixed-size `std::array` generation (Lőrinc)

Pull request description:

  This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043)

  ### Summary

  Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
  This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks.

  Since this obfuscation is optional, the speedup measured here depends on whether it's a [random value](#31144 (comment)) or [completely turned off](#31144 (comment)) (i.e. XOR-ing with 0).

  ### Changes in testing, benchmarking and implementation

  * Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
  * Migrated `std::vector<std::byte>(8)` keys to plain `uint64_t`;
  * Process unaligned bytes separately and unroll body to 64 bytes.

  ### Assembly

  Memory alignment is enforced by a small peel-loop (`std::memcpy` is optimized out on tested platform), with an `std::assume_aligned<8>` check, see the Godbolt listing at https://godbolt.org/z/59EMv7h6Y for details

  <details>
  <summary>Details</summary>

  Target & Compiler | Stride (per hot-loop iter) | Main operation(s) in loop | Effective XORs / iter
  -- | -- | -- | --
  Clang x86-64 (trunk) | 64 bytes | 4 × movdqu → pxor → store | 8 × 64-bit
  GCC x86-64 (trunk) | 64 bytes | 4 × movdqu/pxor sequence, enabled by 8-way unroll | 8 × 64-bit
  GCC RV32 (trunk) | 8 bytes | copy 8 B to temp → 2 × 32-bit XOR → copy back | 1 × 64-bit (as 2 × 32-bit)
  GCC s390x (big-endian 14.2) | 64 bytes | 8 × XC (mem-mem 8-B XOR) with key cached on stack | 8 × 64-bit

  </details>

  ### Endianness

  The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption.
  Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order.
  Since CI lacks a big-endian machine, testing was done locally via Docker.
  <details>
  <summary>Details</summary>

  ```bash
  brew install podman pigz
  softwareupdate --install-rosetta
  podman machine init
  podman machine start
  docker run --platform linux/s390x -it ubuntu:latest /bin/bash
    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
    cd /mnt && git clone --depth=1 https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    ./build/bin/test_bitcoin --run_test=streams_tests
  ```

  </details>

  ### Measurements (micro benchmarks and full IBDs)

  > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc/clang -DCMAKE_CXX_COMPILER=g++/clang++ && \
    cmake --build build -j$(nproc) && \
    build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000

  <details>
  <summary>GNU 14.2.0</summary>

  > Before:

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.84 |    1,184,138,235.64 |    0.0% |            9.01 |            3.03 |  2.971 |           1.00 |    0.1% |      5.50 | `ObfuscationBench`

  > After (first optimizing commit):

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.04 |   28,365,698,819.44 |    0.0% |            0.34 |            0.13 |  2.714 |           0.07 |    0.0% |      5.33 | `ObfuscationBench`

  > and (second optimizing commit):

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.03 |   32,464,658,919.11 |    0.0% |            0.50 |            0.11 |  4.474 |           0.08 |    0.0% |      5.29 | `ObfuscationBench`

  </details>

  <details>
  <summary>Clang 20.1.7</summary>

  > Before:

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.89 |    1,124,087,330.23 |    0.1% |            6.52 |            3.20 |  2.041 |           0.50 |    0.2% |      5.50 | `ObfuscationBench`

  > After (first optimizing commit):

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.08 |   13,012,464,203.00 |    0.0% |            0.65 |            0.28 |  2.338 |           0.13 |    0.8% |      5.50 | `ObfuscationBench`

  > and (second optimizing commit):

  |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
  |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
  |                0.02 |   41,231,547,045.17 |    0.0% |            0.30 |            0.09 |  3.463 |           0.02 |    0.0% |      5.47 | `ObfuscationBench`

  </details>

  i.e. 27.4x faster obfuscation with GCC, 36.7x faster with Clang

  For other benchmark speedups see  https://corecheck.dev/bitcoin/bitcoin/pulls/31144

  ------

  Running an IBD until 888888 blocks reveals a 4% speedup.

  <details>
  <summary>Details</summary>

  SSD:

  ```bash
  COMMITS="8324a00bd4a6a5291c841f2d01162d8a014ddb02 5ddfd31"; \
  STOP_HEIGHT=888888; DBCACHE=1000; \
  CC=gcc; CXX=g++; \
  BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
  (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
  hyperfine \
    --sort 'command' \
    --runs 1 \
    --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
    --parameter-list COMMIT ${COMMITS// /,} \
    --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
      cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && \
      cmake --build build -j$(nproc) --target bitcoind && \
      ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
    --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
  ```

  > 8324a00 test: Compare util::Xor with randomized inputs against simple impl
  > 5ddfd31 optimization: Xor 64 bits together instead of byte-by-byte

  ```python
  Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00)
    Time (abs ≡):        25033.413 s               [User: 33953.984 s, System: 2613.604 s]

  Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31)
    Time (abs ≡):        24110.710 s               [User: 33389.536 s, System: 2660.292 s]

  Relative speed comparison
          1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00)
          1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31)
  ```

  > HDD:

  ```bash
  COMMITS="71eb6eaa740ad0b28737e90e59b89a8e951d90d9 4685403"; \
  STOP_HEIGHT=888888; DBCACHE=4500; \
  CC=gcc; CXX=g++; \
  BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
  (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
  hyperfine \
    --sort 'command' \
    --runs 2 \
    --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
    --parameter-list COMMIT ${COMMITS// /,} \
    --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
      cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
      ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
    --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
  ```

  > 71eb6ea test: compare util::Xor with randomized inputs against simple impl
  > 4685403 optimization: migrate fixed-size obfuscation from `std::vector<std::byte>` to `uint64_t`

  ```python
  Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea)
    Time (mean ± σ):     37676.293 s ± 83.100 s    [User: 36900.535 s, System: 2220.382 s]
    Range (min … max):   37617.533 s … 37735.053 s    2 runs

  Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403)
    Time (mean ± σ):     36181.287 s ± 195.248 s    [User: 34962.822 s, System: 1988.614 s]
    Range (min … max):   36043.226 s … 36319.349 s    2 runs

  Relative speed comparison
          1.04 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6ea)
          1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 4685403)
  ```

  </details>

ACKs for top commit:
  achow101:
    ACK 248b6a2
  maflcko:
    review ACK 248b6a2 🎻
  ryanofsky:
    Code review ACK 248b6a2. Looks good! Thanks for adapting this and considering all the suggestions. I did leave more comments below but non are important and this looks good as-is

Tree-SHA512: ef541cd8a1f1dc504613c4eaa708202e32ae5ac86f9c875e03bcdd6357121f6af0860ef83d513c473efa5445b701e59439d416effae1085a559716b0fd45ecd6
@l0rinc
Copy link
Contributor Author

l0rinc commented Jul 20, 2025

We're making progress, #31144 was just merged! 🎉
The next ones that need some love are:

achow101 added a commit that referenced this pull request Jul 28, 2025
…line

d5104cf prevector: store `P2WSH`/`P2TR`/`P2PK` scripts inline (Lőrinc)
5212150 test: assert `CScript` allocation characteristics (Lőrinc)
65ac7f6 refactor: modernize `CScriptBase` definition (Lőrinc)
756da2a refactor: extract `STATIC_SIZE` constant to prevector (Lőrinc)

Pull request description:

  This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043)

  ### Summary

  The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (#6914) before `SegWit` and `TapRoot`.
  However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

  The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`).

  ### Context
  Increasing the `prevector` size allows these scripts to be stored inline, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads.

  Due to memory alignment, increasing the prevector size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost.

  <details>
  <summary>Massif measurements</summary>

  > dbcache=440

  Massif before, with a heap threshold of `28`:
  ```bash
      MB
  744.1^#
       |#: ::::::@: :::::::   :@:: @::::::::::::::@@
       |#: ::::::@::::: :::   :@:::@:::::: :: ::::@
       |#: ::::::@::::: :::   :@:::@:::::: :: ::::@
       |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@
       |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@
       |#: ::::::@::::: ::: : :@:::@:::::: :: ::::@
       |#::::::::@::::: ::: : :@:::@:::::: :: ::::@
       |#::::::::@::::: ::: :::@:::@:::::: :: ::::@
       |#::::::::@::::: ::: :::@:::@:::::: :: ::::@
       |#::::::::@::::: ::: :::@:::@:::::: :: ::::@
       |#::::::::@::::: ::: :::@:::@:::::: :: ::::@
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
       |#::::::::@::::: :::::::@:::@:::::: :: ::::@ :::::@:::::@:::::@:::::@::::
     0 +----------------------------------------------------------------------->h
       0                                                                   1.805
  ```

  and after, with a heap threshold of `36`:
  ```bash
      MB
  744.2^       :
       |#  :  :::::::::::   : : :: ::: @@:::::: ::  :
       |#  :  :::: ::::::   : : :: ::: @ :: ::  :   :
       |#  :  :::: :::::::  : :@:: ::: @ :: ::  : :::
       |#  :  :::: :::::::  : :@:: ::: @ :: ::  : : :
       |#  :  :::: :::::::  : :@:: ::: @ :: ::  : : :
       |#  :  :::: :::::::  : :@:: ::: @ :: ::  : : :
       |#  :: :::: :::::::  : :@:: ::: @ :: ::  : : :
       |#  :: :::: :::::::  : :@:: ::::@ :: ::  : : :
       |#:::: :::: :::::::  :::@:: ::::@ :: ::  : : :
       |#: ::::::: :::::::  :::@:: ::::@ :: :: @: : :
       |#: ::::::: :::::::  :::@:::::::@ :: :: @: : :
       |#: ::::::: ::::::::::::@:::::::@ :: :: @: : :
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
       |#: ::::::: :::::::: :::@:::::::@ :: :: @: : :::@:::@::::@::::::@:::::@::
     0 +----------------------------------------------------------------------->h
       0                                                                   1.618
  ```

  ---

  > for `dbcache=4500`:

  Massif before, with a heap threshold of `28`:
  ```bash
      GB
  4.565^   ::
       | ##:   @@:::  :::: :@::::  :::: ::::
       | # :   @ ::   :::  :@: ::  : :: :::
       | # :   @ :: :::::  :@: ::  : :: :::
       | # :   @ :: : :::  :@: :: @: :: :::
       | # :   @ :: : :::  :@: :: @: :: :::
       | # :   @ :: : :::  :@: :: @: :: :::
       | # :   @ :: : :::  :@: :: @: :: :::
       | # : ::@ :: : :::  :@: :: @: :: :::
       | # : : @ :: : :::  :@: :: @: :: :::
       | # : : @ :: : :::  :@: :: @: ::::::
       | # : : @ :: : :::  :@: :: @: ::::::
       | # : : @ :: : :::  :@: :: @: ::::::
       | # : : @ :: : ::: ::@: :: @: ::::::
       | # : : @ :: : ::: ::@: :: @: ::::::
       | # : : @ :: : ::: ::@: :: @: ::::::
       | # : : @ :: : ::: ::@: :: @: :::::: @::
       | # : : @ :: : ::: ::@: :: @: :::::: @:
       | # : : @ :: : ::: ::@: :::@: :::::: @:
       | # : : @ :: : ::: ::@: :::@: :::::: @: :::::::::::::::::::::::::::::@:::
     0 +----------------------------------------------------------------------->h
       0                                                                   1.500
  ```

  and after, with a heap threshold of `36`:
  ```
      GB
  4.640^    :
       | ##::  :::::   ::::  ::::::@  ::::
       | # ::  : :::   ::::  :: :::@  ::::
       | # :: :: :::   ::::  :: :::@  ::::
       | # :: :: :::  :::::  :: :::@  ::::
       | # :: :: :::  :::::  :: :::@  ::::
       | # :: :: :::  :::::  :: :::@  ::::
       | # :: :: :::  :::::  :: :::@  ::::
       | # :: :: :::  :::::  :: :::@  ::::  :@@
       | # :: :: :::  ::::: ::: :::@  :::::::@
       | # :: :: :::  ::::: ::: :::@  ::::: :@
       | # :: :: :::  ::::: ::: :::@::::::: :@
       | # ::::: :::  ::::: ::: :::@: ::::: :@
       | # ::::: :::  ::::: ::: :::@: ::::: :@
       | # ::::: :::::::::: ::: :::@: ::::: :@
       | # ::::: :::: ::::: ::: :::@: ::::: :@
       | # ::::: :::: ::::: ::: :::@: ::::: :@
       | # ::::: :::: ::::::::: :::@: ::::: :@
       | # ::::: :::: ::::::::: :::@: ::::: :@
       | # ::::: :::: ::::::::: :::@: ::::: :@ ::::::@:::@:::@::::@:::::@::::@::
     0 +----------------------------------------------------------------------->h
       0                                                                   1.360
  ```

  </details>

  ### Benchmarks and Memory

  Performance benchmarks for `AssumeUTXO` load and flush show:
  - Small dbcache (450MB): ~1-3% performance improvement (despite more frequent flushes)
  - Large dbcache (4500MB): ~6-8% performance improvement due to fewer heap allocations (and basically the number of flushes)
  - Very large dbcache (4500MB): ~5-6% performance improvement due to fewer heap allocations (and memory limit not being reached, so there's no memory penalty)

  Full IBD and `-reindex-chainstate` with also show an overall ~3-4% speedup (both for smaller and larger dbcache values).

  We haven't investigated using different `prevector` sizes based on script type, though this could be explored in the future if needed.

  ### Historical explanation for the speedup (by [Anthony Towns](#32279 (comment)))

  > I think the tradeoff is something like:
  >
  > * spends of p2pk, p2sh, p2pkh coins -- these cost 8 more bytes
  > * spends of p2wpkh -- these cost 16 more bytes (sPK and scriptSig didn't need an allocation)
  > * spends of p2wsh and p2tr -- these cost ~48 fewer bytes (save 64 byte allocation on 64bit system, lose 8 bytes for both scriptSig and sPK)
  > * spends of nested p2wsh -- presumably save ~96 bytes, since the scriptSig would save an allocation, but I'm bundling it in the previous section
  >
  > Based on mainnet.observer stats for 2025-05-08, p2wpkh is about 55% of txs, p2tr is about 28%, p2pkh about 13%, p2wsh about 4% and the rest is noise, maybe? Those numbers net out to a saving of ~5.5 bytes per input. If p2wpkh rose from 55% to 80% and p2tr dropped to 20%, that would net to wasting ~3.2 bytes per input.

ACKs for top commit:
  maflcko:
    review ACK d5104cf 🐺
  achow101:
    reACK d5104cf
  jonatack:
    Review ACK d5104cf
  andrewtoth:
    ACK d5104cf

Tree-SHA512: 7c5271ebaf4f6d91dc4b679ecbde4b7d0467579f072289f30da988a17c38a552d0b8cdf0e9c001739975dd019894c35e541908571527916cec56e04a8e214ae2
@l0rinc
Copy link
Contributor Author

l0rinc commented Jul 28, 2025

Thanks for reviewing and reproducing #32279 - it's also merged 🎉

Reviews and ACKs for the above 2 remaining ones would be very welcome.

@maflcko
Copy link
Member

maflcko commented Jul 29, 2025

The remaining ones don't speed up IBD, do they?

@l0rinc
Copy link
Contributor Author

l0rinc commented Jul 29, 2025

The big ones are merged, thanks for your help.
These are smaller ones - at least from an IBD perspective -, but both are quite simple.

Edit: rebased #31645 which does speed up a critical section of IBD measurably.

@pstratem
Copy link
Contributor

pstratem commented Aug 1, 2025

We're calling CBlockHeader::GetHash hundreds of times for the same CBlockHeader object from ActivateBestChain.

I chased the calls with this commit and found all the results where from ProcessNewBlock ActivateBestChain

pstratem@7de2330

@l0rinc
Copy link
Contributor Author

l0rinc commented Aug 1, 2025

Thanks @pstratem, I did almost exactly the same, the main problem is indeed the loop in https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L3498, probably bounded by BLOCK_DOWNLOAD_WINDOW - hence the ~1000 worst cases.
I have a fix for most duplications, I'm running an IBD (and a reindex-chainstate) on my benchmarking servers to see the effect of the deduplications. I expect at most a 1-2% change, but it may still be worth it.
I'll also investigate the effect of lowering BLOCK_DOWNLOAD_WINDOW.

@l0rinc
Copy link
Contributor Author

l0rinc commented Aug 18, 2025

I have compared v29.1rc1 with the latest master branch containing these optimizations:

Both branches still have defaultAssumeValid of 886'157:

The performance will improve further after we will adjust it before the release.


Running a reindex-chainstate benchmark until block 909090 with 5GB memory reveals a 9% speedup (8h:07m vs 7h:23m)!

i7, hdd, -reindex-chainstate
COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

565af03 Merge #33056: [29.x] final changes for v29.1rc1
dadf15f Merge #33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     29308.463 s ± 72.560 s    [User: 41477.069 s, System: 1134.925 s]
  Range (minmax):   29257.156 s29359.771 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     26780.313 s ± 240.625 s    [User: 38463.008 s, System: 1017.529 s]
  Range (minmax):   26610.166 s26950.460 s    2 runs
  
Relative speed comparison
        1.09 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

Running a full IBD (3 runs per commit) until block 909090 with 5GB memory reveals a 21% speedup (9h:06m vs 7h:22m)! The difference comes mostly from the lack of block writes (and fewer block reads) during reindexes.

i9, SSD, full IBD
COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 3 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

565af03 Merge #33056: [29.x] final changes for v29.1rc1
dadf15f Merge #33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     32853.378 s ± 31.353 s    [User: 54364.406 s, System: 2245.790 s]
  Range (minmax):   32817.179 s32871.937 s    3 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     27046.236 s ± 631.042 s    [User: 49185.359 s, System: 2613.586 s]
  Range (minmax):   26545.657 s27755.087 s    3 runs
 
Relative speed comparison
        1.21 ±  0.03  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

There are 2 optimizations that we could still include in the next release, reviewers could help with them:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants