Skip to content

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Feb 14, 2025

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download


This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don't like all the specializations.


Summary

This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

Context

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn't needed:

  • AutoFile write doesn't need to allocate 4k buffer for a single byte now;
  • VectorWriter and DataStream avoids memcpy/insert calls;
  • CSHA256::Write can avoid memcpy.

DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

Measurements

C compiler ............................ AppleClang 16.0.0.16000026

Before:

ns/block block/s err% total benchmark
195,610.62 5,112.20 0.3% 11.00 SerializeBlock
12,061.83 82,906.19 0.1% 11.01 SizeComputerBlock

After:

ns/block block/s err% total benchmark
174,569.19 5,728.39 0.6% 10.89 SerializeBlock
10,241.16 97,645.21 0.0% 11.00 SizeComputerBlock

SerializeBlock - ~12.% faster
SizeComputerBlock - ~17.7% faster


C++ compiler .......................... GNU 13.3.0

Before:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
867,857.55 1,152.26 0.0% 8,015,883.90 3,116,099.08 2.572 1,517,035.87 0.5% 10.81 SerializeBlock
30,928.27 32,332.88 0.0% 221,683.03 111,055.84 1.996 53,037.03 0.8% 11.03 SizeComputerBlock

After:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
615,000.56 1,626.01 0.0% 8,015,883.64 2,208,340.88 3.630 1,517,035.62 0.5% 10.56 SerializeBlock
25,676.76 38,945.72 0.0% 159,390.03 92,202.10 1.729 42,131.03 0.9% 11.00 SizeComputerBlock

SerializeBlock - ~41.1% faster
SizeComputerBlock - ~20.4% faster


While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

Details
COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
STOP_HEIGHT=880000; DBCACHE=10000; \
C_COMPILER=gcc; CXX_COMPILER=g++; \
hyperfine \
--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
--runs 3 \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
  Range (min … max):   33283.439 s … 34229.026 s    3 runs
 
Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
  Range (min … max):   32888.211 s … 33253.706 s    3 runs
 
Summary
  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 14, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33026 (test, refactor: Embedded ASmap selected preparatory work by fjahr)
  • #32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
  • #31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
  • #28792 (Embed default ASMap as binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 9, 2025

Drafting until #31519 is merged, as recommended in #31868 (comment)

@l0rinc l0rinc marked this pull request as draft March 9, 2025 23:59
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from c5bbf57 to 19d20ce Compare March 10, 2025 00:09
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/38458137649

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@l0rinc l0rinc marked this pull request as ready for review April 17, 2025 12:51
Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok, albeit that I'm seeing more modest performance improvements on my end.

@@ -723,6 +723,21 @@ CSHA256& CSHA256::Write(const unsigned char* data, size_t len)
}
return *this;
}
CSHA256& CSHA256::Write(unsigned char data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about all these crypto changes. Will this be noticable at all? Why only do it for the sha256 hasher? Maybe do these once/if the other single byte steam writer changes are accepted.

@maflcko
Copy link
Member

maflcko commented Apr 17, 2025

While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

Also, what code part is hitting this during IBD, given that reads and writes are now buffered?

@l0rinc l0rinc marked this pull request as draft April 17, 2025 15:43
@l0rinc
Copy link
Contributor Author

l0rinc commented Apr 17, 2025

what code part is hitting this during IBD

During serialization we have very many 1 byte writes, this change avoids the heavy allocations.
But drafting until I have time to properly remeasure everything after the recent merge.

@maflcko
Copy link
Member

maflcko commented Apr 17, 2025

I understand that there are many 1-byte writes, but they go into the BufferedReader/Writer, which is not modified here. The Buffered* then passes them to AutoFile, which is modified here, but never receives 1-byte writes anymore

@TheCharlatan
Copy link
Contributor

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I'd expect outside of just pure block validation.

@maflcko
Copy link
Member

maflcko commented May 4, 2025

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I'd expect outside of just pure block validation.

Ok, seems fine. It would be good to list the main motivation(s) in the OP. Currently, it only says IBD and that IBD isn't the main motivation. However, I don't understand how this could have any effect at all, from looking at the code, see #31868 (comment).

@l0rinc
Copy link
Contributor Author

l0rinc commented May 4, 2025

I will work on this more in the upcoming weeks, drafted it for now since I don't like the current single-byte specializations, have to find a better way. Suggestions are welcome.

@maflcko
Copy link
Member

maflcko commented May 4, 2025

My guess is that the concept specialization here at most could have an effect on xor'd IO? This is because anything not xor'd won't need a buffer to (un)do the xor before reading or writing.

Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it?

However, for block serialization they are already used.

Again, no objection here, but it would be good to explain:

  • What is the goal of the changes here, and what real-world usage do they affect
  • How is it achieved and how do the benchmarks represent the real-world usage

Right now:

  • It only states IBD as motivation and also that IBD isn't the motivation. Also the IBD speedup measurement seems outdated.
  • The added benchmark is about test-only code paths (DataStream) or p2p code paths (VectorWriter). It is unclear what the AutoFile changes are about. Also, it is unclear how this could possibly affect the kernel, given that the kernel doesn't call into test-only or p2p-only code paths and instead uses BufferedReader/Writer.

@l0rinc
Copy link
Contributor Author

l0rinc commented Jul 29, 2025

Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly... Suggestions are welcome.
In the meantime I have split out the first refactor into #33093.

achow101 added a commit that referenced this pull request Jul 31, 2025
…addata16be`

0431a69 cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` (Lőrinc)

Pull request description:

  Remove dead code from serialization logic - extracted from #31868

ACKs for top commit:
  maflcko:
    lgtm ACK 0431a69
  achow101:
    ACK 0431a69
  jlest01:
    ACK 0431a69

Tree-SHA512: 1881a164b2a91bb6033770db625f10b845bfb17a9898efb7612ca29ba113175b29e345beb84488f388b4639ade98df98e3411e663149bcbaec6e3abeffe1cbef
l0rinc added 5 commits July 31, 2025 18:32
Measure both full block serialization and size computation via `SizeComputer`.
`SizeComputer` returns the exact final size of the serialized content without writing any bytes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
|           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
|           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`
Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
Endianness doesn’t affect final size, so skip it in `SizeComputer`.
Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a84f12e to f4cb559 Compare August 1, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants