[IBD] specialize block serialization #31868

l0rinc · 2025-02-14T16:48:23Z

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don't like all the specializations.

Summary

This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

Context

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn't needed:

AutoFile write doesn't need to allocate 4k buffer for a single byte now;
VectorWriter and DataStream avoids memcpy/insert calls;
CSHA256::Write can avoid memcpy.

DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

Measurements

C compiler ............................ AppleClang 16.0.0.16000026

Before:

ns/block	block/s	err%	total	benchmark
195,610.62	5,112.20	0.3%	11.00	`SerializeBlock`
12,061.83	82,906.19	0.1%	11.01	`SizeComputerBlock`

After:

ns/block	block/s	err%	total	benchmark
174,569.19	5,728.39	0.6%	10.89	`SerializeBlock`
10,241.16	97,645.21	0.0%	11.00	`SizeComputerBlock`

SerializeBlock - ~12.% faster
SizeComputerBlock - ~17.7% faster

C++ compiler .......................... GNU 13.3.0

Before:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
867,857.55	1,152.26	0.0%	8,015,883.90	3,116,099.08	2.572	1,517,035.87	0.5%	10.81	`SerializeBlock`
30,928.27	32,332.88	0.0%	221,683.03	111,055.84	1.996	53,037.03	0.8%	11.03	`SizeComputerBlock`

After:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
615,000.56	1,626.01	0.0%	8,015,883.64	2,208,340.88	3.630	1,517,035.62	0.5%	10.56	`SerializeBlock`
25,676.76	38,945.72	0.0%	159,390.03	92,202.10	1.729	42,131.03	0.9%	11.00	`SizeComputerBlock`

SerializeBlock - ~41.1% faster
SizeComputerBlock - ~20.4% faster

While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

Details

COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
STOP_HEIGHT=880000; DBCACHE=10000; \
C_COMPILER=gcc; CXX_COMPILER=g++; \
hyperfine \
--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
--runs 3 \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
  Range (min … max):   33283.439 s … 34229.026 s    3 runs
 
Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
  Range (min … max):   32888.211 s … 33253.706 s    3 runs
 
Summary
  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0

DrahtBot · 2025-02-14T16:48:28Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33026 (test, refactor: Embedded ASmap selected preparatory work by fjahr)
#32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
#31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
#28792 (Embed default ASMap as binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/streams.h

l0rinc · 2025-03-09T23:59:46Z

Drafting until #31519 is merged, as recommended in #31868 (comment)

DrahtBot · 2025-03-10T00:09:11Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/38458137649}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

TheCharlatan

This looks ok, albeit that I'm seeing more modest performance improvements on my end.

TheCharlatan · 2025-04-17T13:58:17Z

src/crypto/sha256.cpp

@@ -723,6 +723,21 @@ CSHA256& CSHA256::Write(const unsigned char* data, size_t len)
    }
    return *this;
 }
+CSHA256& CSHA256::Write(unsigned char data)


I'm not sure about all these crypto changes. Will this be noticable at all? Why only do it for the sha256 hasher? Maybe do these once/if the other single byte steam writer changes are accepted.

maflcko · 2025-04-17T15:40:56Z

While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

Also, what code part is hitting this during IBD, given that reads and writes are now buffered?

l0rinc · 2025-04-17T15:47:52Z

what code part is hitting this during IBD

During serialization we have very many 1 byte writes, this change avoids the heavy allocations.
But drafting until I have time to properly remeasure everything after the recent merge.

maflcko · 2025-04-17T15:50:50Z

I understand that there are many 1-byte writes, but they go into the BufferedReader/Writer, which is not modified here. The Buffered* then passes them to AutoFile, which is modified here, but never receives 1-byte writes anymore

TheCharlatan · 2025-05-01T15:43:00Z

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I'd expect outside of just pure block validation.

maflcko · 2025-05-04T13:15:13Z

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I'd expect outside of just pure block validation.

Ok, seems fine. It would be good to list the main motivation(s) in the OP. Currently, it only says IBD and that IBD isn't the main motivation. However, I don't understand how this could have any effect at all, from looking at the code, see #31868 (comment).

l0rinc · 2025-05-04T13:49:40Z

I will work on this more in the upcoming weeks, drafted it for now since I don't like the current single-byte specializations, have to find a better way. Suggestions are welcome.

maflcko · 2025-05-04T14:46:48Z

My guess is that the concept specialization here at most could have an effect on xor'd IO? This is because anything not xor'd won't need a buffer to (un)do the xor before reading or writing.

Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it?

However, for block serialization they are already used.

Again, no objection here, but it would be good to explain:

What is the goal of the changes here, and what real-world usage do they affect
How is it achieved and how do the benchmarks represent the real-world usage

Right now:

It only states IBD as motivation and also that IBD isn't the motivation. Also the IBD speedup measurement seems outdated.
The added benchmark is about test-only code paths (DataStream) or p2p code paths (VectorWriter). It is unclear what the AutoFile changes are about. Also, it is unclear how this could possibly affect the kernel, given that the kernel doesn't call into test-only or p2p-only code paths and instead uses BufferedReader/Writer.

l0rinc · 2025-07-29T23:34:29Z

Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly... Suggestions are welcome.
In the meantime I have split out the first refactor into #33093.

…addata16be` 0431a69 cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` (Lőrinc) Pull request description: Remove dead code from serialization logic - extracted from #31868 ACKs for top commit: maflcko: lgtm ACK 0431a69 achow101: ACK 0431a69 jlest01: ACK 0431a69 Tree-SHA512: 1881a164b2a91bb6033770db625f10b845bfb17a9898efb7612ca29ba113175b29e345beb84488f388b4639ade98df98e3411e663149bcbaec6e3abeffe1cbef

Measure both full block serialization and size computation via `SizeComputer`. `SizeComputer` returns the exact final size of the serialized content without writing any bytes. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 195,610.62 | 5,112.20 | 0.3% | 11.00 | `SerializeBlock` | 12,061.83 | 82,906.19 | 0.1% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 867,857.55 | 1,152.26 | 0.0% | 8,015,883.90 | 3,116,099.08 | 2.572 | 1,517,035.87 | 0.5% | 10.81 | `SerializeBlock` | 30,928.27 | 32,332.88 | 0.0% | 221,683.03 | 111,055.84 | 1.996 | 53,037.03 | 0.8% | 11.03 | `SizeComputerBlock`

Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions). This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later

Endianness doesn’t affect final size, so skip it in `SizeComputer`. Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 191,652.29 | 5,217.78 | 0.4% | 10.96 | `SerializeBlock` | 10,323.55 | 96,865.92 | 0.2% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 614,847.32 | 1,626.42 | 0.0% | 8,015,883.64 | 2,207,628.07 | 3.631 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 26,020.31 | 38,431.52 | 0.0% | 159,390.03 | 93,438.33 | 1.706 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector). It makes sense to avoid the generalized serialization infrastructure that isn't needed: * AutoFile write doesn't need to allocate 4k buffer for a single byte now; * `VectorWriter` and `DataStream` avoids memcpy/insert calls. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 174,569.19 | 5,728.39 | 0.6% | 10.89 | `SerializeBlock` | 10,241.16 | 97,645.21 | 0.0% | 11.00 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 615,000.56 | 1,626.01 | 0.0% | 8,015,883.64 | 2,208,340.88 | 3.630 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 25,676.76 | 38,945.72 | 0.0% | 159,390.03 | 92,202.10 | 1.729 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

l0rinc mentioned this pull request Feb 14, 2025

reduce template bloat in primitive serialization bitcoin-dev-tools/benchcoin#116

Closed

theuni reviewed Feb 14, 2025

View reviewed changes

src/streams.h Outdated Show resolved Hide resolved

This was referenced Feb 15, 2025

[IBD] specialize CheckBlock's input & coinbase checks #31682

Open

[IBD] batch block reads/writes during AutoFile serialization #31551

Merged

refactor: Use std::span over Span #31519

Merged

[IBD] multi-byte block obfuscation #31144

Merged

DrahtBot added CI failed and removed CI failed labels Feb 16, 2025

l0rinc mentioned this pull request Feb 19, 2025

refactor: modernize outdated trait patterns using helper aliases (C++14/C++17) #31904

Merged

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a7db42f to c5bbf57 Compare March 9, 2025 23:58

l0rinc marked this pull request as draft March 9, 2025 23:59

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from c5bbf57 to 19d20ce Compare March 10, 2025 00:09

DrahtBot added the CI failed label Mar 10, 2025

l0rinc force-pushed the lorinc/block-serialization-optimizations branch 2 times, most recently from f98fb61 to a9d311c Compare March 10, 2025 14:54

l0rinc marked this pull request as ready for review April 17, 2025 12:51

TheCharlatan reviewed Apr 17, 2025

View reviewed changes

l0rinc marked this pull request as draft April 17, 2025 15:43

DrahtBot mentioned this pull request Apr 17, 2025

refactor: reenable implicit-integer-sign-change check for serialize.h #32296

Merged

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from 5317366 to 073e28b Compare April 19, 2025 18:55

DrahtBot mentioned this pull request Apr 23, 2025

Embed default ASMap as binary dump header file #28792

Open

DrahtBot mentioned this pull request May 9, 2025

bench: replace benchmark block with more representative one (413567 → 784588) #32457

Closed

DrahtBot added CI failed Needs rebase labels May 16, 2025

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from 073e28b to a84f12e Compare July 29, 2025 23:26

l0rinc mentioned this pull request Jul 29, 2025

refactor: remove unused ser_writedata16be and ser_readdata16be #33093

Merged

DrahtBot removed the Needs rebase label Jul 30, 2025

DrahtBot mentioned this pull request Jul 30, 2025

test, refactor: Embedded ASmap selected preparatory work #33026

Open

DrahtBot removed the CI failed label Jul 30, 2025

DrahtBot mentioned this pull request Jul 30, 2025

test,refactor: extract script template helpers & widen sigop count coverage #32729

Open

l0rinc added 5 commits July 31, 2025 18:32

refactor: add explicit static extent to spans

f75a494

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a84f12e to f4cb559 Compare August 1, 2025 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IBD] specialize block serialization #31868

[IBD] specialize block serialization #31868

Uh oh!

l0rinc commented Feb 14, 2025 •

edited

Loading

Uh oh!

DrahtBot commented Feb 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

l0rinc commented Mar 9, 2025

Uh oh!

DrahtBot commented Mar 10, 2025

Uh oh!

TheCharlatan left a comment

Uh oh!

TheCharlatan Apr 17, 2025

Uh oh!

maflcko commented Apr 17, 2025

Uh oh!

l0rinc commented Apr 17, 2025

Uh oh!

maflcko commented Apr 17, 2025

Uh oh!

TheCharlatan commented May 1, 2025

Uh oh!

maflcko commented May 4, 2025

Uh oh!

l0rinc commented May 4, 2025

Uh oh!

maflcko commented May 4, 2025

Uh oh!

l0rinc commented Jul 29, 2025

Uh oh!

Uh oh!

[IBD] specialize block serialization #31868

Are you sure you want to change the base?

[IBD] specialize block serialization #31868

Uh oh!

Conversation

l0rinc commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Measurements

Uh oh!

DrahtBot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

Uh oh!

l0rinc commented Mar 9, 2025

Uh oh!

DrahtBot commented Mar 10, 2025

Uh oh!

TheCharlatan left a comment

Choose a reason for hiding this comment

Uh oh!

TheCharlatan Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

maflcko commented Apr 17, 2025

Uh oh!

l0rinc commented Apr 17, 2025

Uh oh!

maflcko commented Apr 17, 2025

Uh oh!

TheCharlatan commented May 1, 2025

Uh oh!

maflcko commented May 4, 2025

Uh oh!

l0rinc commented May 4, 2025

Uh oh!

maflcko commented May 4, 2025

Uh oh!

l0rinc commented Jul 29, 2025

Uh oh!

Uh oh!

l0rinc commented Feb 14, 2025 •

edited

Loading

DrahtBot commented Feb 14, 2025 •

edited

Loading