-
Notifications
You must be signed in to change notification settings - Fork 37.7k
util/blockstorage: Add TRACE_RAII
, slightly faster -reindex-chainstate with CBufferedFile
#28226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/28226. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
So that CAutoFile is replaceable with CBufferedFile
That way we can easily get reliable statistics about the serialization code, without interference.
…File While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. This replaces the use of CAutoFile with CBufferedFile with a small buffer, leading to much fewer calls to fread(). With the TRACE_RAII in place it is easier to reliably measure performance of the critical part, just measuring `-reindex-chainstate` runtime can be quite noisy. Nevertheless, I ran this command which took about 30 minutes on my PC: ``` sync && sudo /sbin/sysctl vm.drop_caches=3 && ~/git/github.com/martinus/bitcoin/src/bitcoind -dbcache=20000 -reindex-chainstate -printtoconsole=0 -stopatheight=500000 ``` The measured time spent in unserializing blocks are: * 308.227s mergebase * 277.827s this commit This 9% improvement seems to translate into roughly 1.2% total runtime improvement, when running the benchmark 10 times and showing the average runtime: ``` Benchmark 1: ./bitcoind -dbcache=20000 -reindex-chainstate -printtoconsole=0 -stopatheight=500000 (mergebase) Time (mean ± σ): 1847.622 s ± 13.448 s [User: 1702.768 s, System: 54.059 s] Range (min … max): 1835.319 s … 1877.065 s 10 runs Benchmark 2: ./bitcoind -dbcache=20000 -reindex-chainstate -printtoconsole=0 -stopatheight=500000 (branch) Time (mean ± σ): 1824.660 s ± 14.394 s [User: 1679.417 s, System: 53.965 s] Range (min … max): 1798.764 s … 1848.940 s 10 runs ``` I have produced these results with `hyperfine`, with this command: ``` hyperfine \ 400 Mbps --parameter-list commit 0f3a6a7,ce26cb6025c47bb4b3e51a579689635da7ca1f6b \ --setup 'git checkout {commit} && make -j$(nproc)' \ --prepare 'sync; sudo /sbin/sysctl vm.drop_caches=3' \ './bitcoind -dbcache=20000 -reindex-chainstate -printtoconsole=0 -stopatheight=500000' ```
47c05d8
to
e926795
Compare
Shouldn't |
I think most of the slowdown is due to the locking required by each I don't mind dropping the change to |
It may be better to modify [1] I may submit a related pull to better enforce this at compile-time. |
🐙 This pull request conflicts with the target branch and needs rebase. |
There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
I tried to remove BufferedFile, but it seemed non-trivial. I think it is needed to recover from datadir corruption caused by previous releases of Bitcoin Core. I wonder how relevant this still is, and whether it is simpler to require those affected to re-do an IBD. Another small benefit would be that the fuzz timeout in |
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
🤔 There hasn't been much activity lately and the CI seems to be failing. If no one reviewed the current pull request by commit hash, a rebase can be considered. While the CI failure may be a false positive, the CI hasn't been running for some time, so there may be a real issue hiding as well. A rebase triggers the latest CI and makes sure that no silent merge conflicts have snuck in. |
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
2 similar comments
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
@martinus, are you still working on this? If not, I'd like to take over, I've also noticed the same during profiling... |
@l0rinc I'm not working on it any more, feel free to take over! |
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
For reference, |
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com>
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com> | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 6,107,470.52 | 163.73 | 0.0% | 65,474,032.49 | 21,525,453.92 | 3.042 | 3,986,348.13 | 0.3% | 10.56 | `ReadBlockFromDiskTest` | 960,629.11 | 1,040.98 | 0.0% | 9,011,495.94 | 3,234,205.43 | 2.786 | 1,002,472.91 | 0.0% | 11.00 | `ReadRawBlockFromDiskTest`
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com> | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 6,107,470.52 | 163.73 | 0.0% | 65,474,032.49 | 21,525,453.92 | 3.042 | 3,986,348.13 | 0.3% | 10.56 | `ReadBlockFromDiskTest`
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com> | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 6,107,470.52 | 163.73 | 0.0% | 65,474,032.49 | 21,525,453.92 | 3.042 | 3,986,348.13 | 0.3% | 10.56 | `ReadBlockFromDiskTest`
This commit revives bitcoin#28226: > While profiling `-reindex-changestate` I saw lots of fread() calls in in ReadBlockFromDisk which can be slow. > This replaces the use of `AutoFile` with `BufferedFile` with a small buffer, leading to much fewer calls to fread()." Running `ReadBlockFromDiskTest` with varying buffer sizes show that `16 KiB` is more effective than the original 1 KiB - achieving a roughly 25% performance improvement compared to master. Co-authored-by: Martin Leitner-Ankerl <martin.ankerl@gmail.com> | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 6,107,470.52 | 163.73 | 0.0% | 65,474,032.49 | 21,525,453.92 | 3.042 | 3,986,348.13 | 0.3% | 10.56 | `ReadBlockFromDiskTest`
…ization 8d801e3 optimization: bulk serialization writes in `WriteBlockUndo` and `WriteBlock` (Lőrinc) 520965e optimization: bulk serialization reads in `UndoRead`, `ReadBlock` (Lőrinc) 056cb3c refactor: clear up blockstorage/streams in preparation for optimization (Lőrinc) 67fcc64 log: unify error messages for (read/write)[undo]block (Lőrinc) a4de160 scripted-diff: shorten BLOCK_SERIALIZATION_HEADER_SIZE constant (Lőrinc) 6640dd5 Narrow scope of undofile write to avoid possible resource management issue (Lőrinc) 3197155 refactor: collect block read operations into try block (Lőrinc) c77e310 refactor: rename leftover WriteBlockBench (Lőrinc) Pull request description: This change is part of [[IBD] - Tracking PR for speeding up Initial Block Download](#32043) ### Summary We can serialize the blocks and undos to any `Stream` which implements the appropriate read/write methods. `AutoFile` is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as [observed](#28226 (comment)) by Martinus in a similar change). ### Unlocking new optimization opportunities Buffered writes will also enable batched obfuscation calculations (implemented in #31144) - especially since currently we need to copy the write input's std::span to do the obfuscation on it, and batching enables doing the operations on the internal buffer directly. ### Measurements (micro benchmarks, full IBDs and reindexes) Microbenchmarks for `[Read|Write]BlockBench` show a ~**30%**/**168%** speedup with `macOS/Clang`, and ~**19%**/**24%** with `Linux/GCC` (the follow-up XOR batching improves these further): <details> <summary>macOS Sequoia - Clang 19.1.7</summary> > Before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 2,271,441.67 | 440.25 | 0.1% | 11.00 | `ReadBlockBench` | 5,149,564.31 | 194.19 | 0.8% | 10.95 | `WriteBlockBench` > After: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,738,683.04 | 575.15 | 0.2% | 11.04 | `ReadBlockBench` | 3,052,658.88 | 327.58 | 1.0% | 10.91 | `WriteBlockBench` </details> <details> <summary>Ubuntu 24 - GNU 13.3.0</summary> > Before: | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 6,895,987.11 | 145.01 | 0.0% | 71,055,269.86 | 23,977,374.37 | 2.963 | 5,074,828.78 | 0.4% | 22.00 | `ReadBlockBench` | 5,152,973.58 | 194.06 | 2.2% | 19,350,886.41 | 8,784,539.75 | 2.203 | 3,079,335.21 | 0.4% | 23.18 | `WriteBlockBench` > After: | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 5,771,882.71 | 173.25 | 0.0% | 65,741,889.82 | 20,453,232.33 | 3.214 | 3,971,321.75 | 0.3% | 22.01 | `ReadBlockBench` | 4,145,681.13 | 241.21 | 4.0% | 15,337,596.85 | 5,732,186.47 | 2.676 | 2,239,662.64 | 0.1% | 23.94 | `WriteBlockBench` </details> 2 full IBD runs against master (compiled with GCC where the gains seem more modest) for **888888** blocks (seeded from real nodes) indicates a ~**7%** total speedup. <details> <summary>Details</summary> ```bash COMMITS="d2b72b13699cf460ffbcb1028bcf5f3b07d3b73a 652b4e3"; \ STOP_HEIGHT=888888; DBCACHE=1000; \ C_COMPILER=gcc; CXX_COMPILER=g++; \ BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \ (for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \ hyperfine \ --sort 'command' \ --runs 2 \ --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$C_COMPILER.json" \ --parameter-list COMMIT ${COMMITS// /,} \ --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \ cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && \ cmake --build build -j$(nproc) --target bitcoind && \ ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \ --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \ "COMPILER=$C_COMPILER COMMIT=${COMMIT:0:10} ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0" d2b72b1 refactor: rename leftover WriteBlockBench 652b4e3 optimization: Bulk serialization writes in `WriteBlockUndo` and `WriteBlock` Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = d2b72b1) Time (mean ± σ): 41528.104 s ± 354.003 s [User: 44324.407 s, System: 3074.829 s] Range (min … max): 41277.786 s … 41778.421 s 2 runs Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 652b4e3) Time (mean ± σ): 38771.457 s ± 441.941 s [User: 41930.651 s, System: 3222.664 s] Range (min … max): 38458.957 s … 39083.957 s 2 runs Relative speed comparison 1.07 ± 0.02 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = d2b72b1) 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 652b4e3) ``` </details> ACKs for top commit: maflcko: re-ACK 8d801e3 🐦 achow101: ACK 8d801e3 ryanofsky: Code review ACK 8d801e3. Most notable change is switching from BufferedReader to ReadRawBlock for block reads, which makes sense, and there are also various cleanups in blockstorage and test code. hodlinator: re-ACK 8d801e3 Tree-SHA512: 24e1dee653b927b760c0ba3c69d1aba15fa5d9c4536ad11cfc2d70196ae16b9228ecc3056eef70923364257d72dc929882e73e69c6c426e28139d31299d08adc
TLDR:
TRACE_RAII
macro to easily trace runtime of a code block.CBufferedFile
inBlockManager::ReadBlockFromDisk
is slightly faster:-reindex-chainstate
While profiling
-reindex-changestate
I saw lots offread()
calls in inBlockManager::ReadBlockFromDisk
. This replaces the use ofCAutoFile
withCBufferedFile
with a small buffer, leading to much fewer calls tofread()
, which gives a little speedup.I measured runtime of the synchronization with the
TRACE_RAII
macro. I ran this command which took about 30 minutes on my PC, with and without CBufferedFile:The measured time spent in unserializing blocks are:
CAutoFile
CBufferedFile
It is a bit hard to measure the total effect on
-reindex-chainstate
due to random fluctuations in the benchmark. For somewhat reliable results I've run the benchmark 10 times, using hyperfine:The results are:
So the 9% improvement in unserializing seems to translate to roughly 1.2% total runtime improvement in
-reindex-chainstate
on my machine.