validation: fetch block inputs on parallel threads 10% faster IBD #31132

andrewtoth · 2024-10-22T14:40:28Z

When fetching inputs in ConnectBlock, each input is fetched from the cache in series. A cache miss means a round trip to the disk db to fetch the outpoint and insert it into the cache. Since the db is locked from being written during ConnectTip, we can fetch all block inputs missing from the cache in parallel on multiple threads before entering ConnectBlock. Using this strategy resulted in a 10% faster IBD.

Doing IBD with 16 vcores from a local peer with default settings, stopping at height 850k:

	Mean [s]	Min [s]	Max [s]	Relative
branch	17065.138 ± 117.439	16982.096	17148.181	1.00
master	18731.509 ± 94.142	18731.509	18864.646	1.10

For later blocks this change makes block connection even faster. Doing an assumeutxo from block 840k to 850k with 15 worker threads, this change is 26% faster. With just a single worker thread, this same benchmark is 6% faster.
Benchmark and flame graph with 15 worker threads
Benchmark and flame graph with 1 worker thread

I have fuzzed for over 500 million iterations with the provided fuzz harness with no issues.

This approach is heavily inspired by CCheckQueue, but we could not easily reuse it since it only checks for validity and doesn't allow us to store results. So, this PR creates a new InputFetcher that loops through all inputs of a block on the main thread and adds their outpoints to a shared vector. After writing, the main thread and worker threads assign ranges of outpoints from the vector and fetch them from the db, and then push the resulting coins onto a thread local vector. Once the threads have finished reading all inputs, the main thread loops through all thread local vectors and inserts the results into the cache.

This PR uses the -par value for the number of threads, which defaults to the number of vcores on the machine or 15 whichever is fewer. This is the same value used for CCheckQueue, so any users that specifically have the multi threaded validation disabled by using -par=1 will also have this feature disabled. This also means the maximum number of input fetching threads is capped at 15.

Since InputFetcher::FetchInputs is blocking, a follow-up can update this to share the thread pool between CCheckQueue and InputFetcher.

DrahtBot · 2024-10-22T14:40:31Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	l0rinc, ismaelsadeeq

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32313 (coins: fix cachedCoinsUsage accounting in CCoinsViewCache by l0rinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot · 2024-10-22T14:45:47Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/31894441286}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc

Concept ACK

I'm still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I'd be interested in how coroutines would perform here instead of trying to find the best batching size manually.

src/inputfetcher.h

src/validation.cpp

src/inputfetcher.h

l0rinc · 2024-10-23T12:35:29Z

src/inputfetcher.h

+    bool m_request_stop GUARDED_BY(m_mutex){false};
+
+    /** Internal function that does the fetching from disk. */
+    void Loop() noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)


We're basically mimicking RocksDB's MultiGet here - but prewarming the cache instead in separate get requests, since we can't really access LevelDB's internals.

Since splitting into buckets isn't trivial and since MultiGet seems to rely on C++20 coroutines (which wasn't available in 2012 when CCheckQueue was written), I'm wondering how much simpler this fetching would be if we had lightweight suspendible threads instead: https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html#multiget

I think it would be similar in complexity, we would still need all the locking mechanisms to prevent multithreaded access.

What would really be great is if we had a similar construction to Rust's std::sync::mpsc.

Can you tell me why we need to prevent multithreaded access exactly? We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread, right?

How would mpsc solve this better? Do you think we need work stealing to make it perfectly parallel? Wouldn't coroutines already achieve the same?

I haven't yet experimented with them, but as far as I understand it, coroutines are just programming paradigm, not magic; they don't do anything of their own, besides making things that were already possible easier to write. In particular, you still need a thread pool or some mechanism for scheduling how to run them,

We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread

If the vectors are thread local, then how can the main thread access them at the end to write them? We also want to be writing throughout while the workers are fetching, not just at the end.

How would mpsc solve this better?

Instead of each worker thread having a local queue of results, which they then append to the global results queue, they could just push each result to the channel individually. The main thread could just pull results off the channel as they arrive, instead of waiting to be awoken by a worker thread that appended all its results to the global queue.

work stealing

That is a concept for async rust, or std::async::mpsc. We can do all this without introducing an async runtime. But, this is getting off topic.

coroutines are just programming paradigm, not magic

That's also what I was counting on! :D

In RocksDB they have high and low priority work (I assume that's just added to the front or the back of a background work deque) – this could align well with @furszy's suggestion for mixing different kinds of background work units.

I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime. Combined with an appropriate scheduling mechanism (such as a thread pool), we could maximize both CPU and IO usage, if I'm not mistaken.
Instead of each thread handling just one task, it could suspend a coroutine while waiting on IO (e.g., a database fetch) and resume it later, effectively maximizing CPU and IO work without needing to know the exact details of the work.

If the vectors are thread local

The vector would still be global, but each thread would only access a single bucket (i.e. global vector of vectors, with each thread from the pool writing only to vector[thread_id], which contains a vector of fetched coins).
When all the work is finished, we'd iterate over the global vector and merge the results into the cache on a single thread.
As mentioned, sorting the outpoints before fetching could help improve data locality and reduce lock contention, and the coroutines above would help with work stealing, ensuring that all threads finish roughly at the same time.

Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime.

That needs async I/O, and is unrelated to coroutines, as far as I understand it. Coroutines just help with keeping track of what to do when the reads come back inside rocksdb.

As long as LevelDB (or whatever database engine we use) internally does not use async I/O, there will be one (waiting) thread per parallel outstanding read request from the database.

Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

As far as I know, the advantage of coroutines over threads is faster context switching, since it doesn’t go through the operating system kernel. This advantage only becomes apparent under extremely high concurrency, such as hundreds of thousands of concurrent tasks. Using coroutines does not eliminate the need for synchronization mechanisms where they are inherently required.

furszy

Cool idea.

Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks?
The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21b, which we could pull off and use for validation.

andrewtoth · 2024-10-23T14:48:28Z

implementing a general-purpose thread pool shared among the validation checks?

Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

#26966 introduces such structure inside 401f21b, which we could pull off and use for validation.

Concept ACK!

l0rinc · 2024-10-24T18:12:38Z

Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

Summary
'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
 1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
0'

f278ca4 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours)
e9e23b5 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)

~~So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.~~
Maybe we could add a new config option (iothreads or iothreadmultiplier or something).
The defaults should likely depend on whether it's an SSD or HDD.

Edit:

Previous results

"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [39993.343777768874],

"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40929.84310861388],

I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

"command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40579.00445769842],

So it's a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it's not faster.

Edit:

Running it on a HDD with a low dbcache value reproduces the original result:

benchmark

hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'

8207d372b2 validation: fetch block inputs in parallel
92fc718592 coins: allow emplacing non-dirty coins internally
Summary
  'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
    1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'

andrewtoth · 2024-10-24T18:31:14Z

So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.

I'm not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

DrahtBot · 2024-10-24T19:25:38Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/32027275494}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot · 2024-10-26T18:52:01Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/32107893176}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

andrewtoth · 2024-12-04T01:02:47Z

Rebased. Since #30039 reading inputs is much faster, so the effect of this is somewhat less significant (17% -> 10%). It's still a significant speedup though so still worth it. Especially for worst case where the cache is completely empty, like on startup or right after it gets flushed due to size.

It is also refactored significantly. The main thread now writes everything before notifying threads, and then joins in working. This lets us do significantly less work in the critical section and parallelize more checks.

DrahtBot · 2024-12-04T01:29:50Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/33884531020}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

maflcko · 2025-06-10T07:02:30Z

Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

HowHsu · 2025-06-26T16:42:59Z

Hi folks, this looks great, since if all the prevout coins of all transactions of a block are loaded in advance, then the optimization in #32791 makes sense.

DrahtBot added the Validation label Oct 22, 2024

andrewtoth force-pushed the threaded-inputs branch from 6d96d87 to 9fcd08e Compare October 22, 2024 14:45

DrahtBot added the CI failed label Oct 22, 2024

andrewtoth changed the title ~~validation: fetch block inputs parallel threads ~17% faster IBD~~ validation: fetch block inputs on parallel threads ~17% faster IBD Oct 22, 2024

andrewtoth force-pushed the threaded-inputs branch from 9fcd08e to fe0fe59 Compare October 22, 2024 14:59

andrewtoth mentioned this pull request Oct 22, 2024

Don't wipe coins cache when full and instead evict LRU clean entries #31102

Closed

andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from e2feb0c to e9e23b5 Compare October 22, 2024 18:46

DrahtBot removed the CI failed label Oct 22, 2024

DrahtBot mentioned this pull request Oct 23, 2024

refactor: prohibit direct flags access in CCoinsCacheEntry and remove invalid tests #30906

Merged

l0rinc suggested changes Oct 23, 2024

View reviewed changes

furszy reviewed Oct 23, 2024

View reviewed changes

andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from 942f300 to 8207d37 Compare October 24, 2024 19:25

DrahtBot added CI failed and removed CI failed labels Oct 24, 2024

andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from a0f6902 to a00dbc5 Compare October 26, 2024 18:51

DrahtBot added the CI failed label Oct 26, 2024

andrewtoth force-pushed the threaded-inputs branch 4 times, most recently from 7115abe to f7ab210 Compare October 26, 2024 23:07

andrewtoth changed the title ~~validation: fetch block inputs on parallel threads ~17% faster IBD~~ validation: fetch block inputs on parallel threads 10% faster IBD Dec 4, 2024

andrewtoth force-pushed the threaded-inputs branch from c2e3f6e to 9d33a3a Compare December 4, 2024 01:29

DrahtBot added the CI failed label Dec 4, 2024

andrewtoth force-pushed the threaded-inputs branch from 9d33a3a to a544477 Compare December 4, 2024 01:56

DrahtBot removed Needs rebase CI failed labels Dec 4, 2024

andrewtoth mentioned this pull request Dec 4, 2024

Experiment: Reuse coin fetches in CheckTxInputs bitcoin-dev-tools/benchcoin#57

Open

DrahtBot added the Needs rebase label Dec 5, 2024

andrewtoth added 6 commits December 4, 2024 19:42

coins: allow emplacing non-dirty coins internally

07d3078

coins: add inputfetcher

d73eacc

tests: add inputfetcher tests

cb5188c

bench: add inputfetcher bench

054b783

fuzz: add inputfetcher fuzz harness

e8c038c

validation: fetch block inputs in parallel

b2da764

andrewtoth force-pushed the threaded-inputs branch from a544477 to b2da764 Compare December 5, 2024 00:43

DrahtBot removed the Needs rebase label Dec 5, 2024

DrahtBot mentioned this pull request Dec 10, 2024

kernel, logging: Pass Logger instances to kernel objects #30342

Draft

DrahtBot mentioned this pull request Jan 22, 2025

Use number of dirty cache entries in flush warnings/logs #31703

Open

l0rinc mentioned this pull request Mar 12, 2025

[IBD] Tracking PR for speeding up Initial Block Download #32043

Draft

11 tasks

DrahtBot mentioned this pull request Mar 12, 2025

[EXPERIMENTAL] Schnorr batch verification for blocks #29491

Draft

5 tasks

DrahtBot mentioned this pull request Apr 20, 2025

coins: fix cachedCoinsUsage accounting in CCoinsViewCache #32313

Open

DrahtBot added the CI failed label Jun 2, 2025

DrahtBot mentioned this pull request Jun 21, 2025

checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

Draft

HowHsu mentioned this pull request Jun 29, 2025

checkqueue: set MAX_SCRIPTCHECK_THREADS to nCores - 1 #32692

Draft

DrahtBot mentioned this pull request Jul 8, 2025

POC: IPC tracing interface #32898

Closed

maflcko mentioned this pull request Aug 21, 2025

ci: Migrate CI to hosted Cirrus Runners #32989

Open

validation: fetch block inputs on parallel threads 10% faster IBD #31132

Are you sure you want to change the base?

validation: fetch block inputs on parallel threads 10% faster IBD #31132

Conversation

andrewtoth commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

DrahtBot commented Oct 22, 2024

Uh oh!

l0rinc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l0rinc Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtoth Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

l0rinc Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

sipa Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

andrewtoth Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

l0rinc Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sipa Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

HowHsu Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

furszy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtoth commented Oct 23, 2024

Uh oh!

l0rinc commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtoth commented Oct 24, 2024

Uh oh!

DrahtBot commented Oct 24, 2024

Uh oh!

DrahtBot commented Oct 26, 2024

Uh oh!

andrewtoth commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Dec 4, 2024

Uh oh!

maflcko commented Jun 10, 2025

Uh oh!

HowHsu commented Jun 26, 2025

Uh oh!

Uh oh!

andrewtoth commented Oct 22, 2024 •

edited

Loading

DrahtBot commented Oct 22, 2024 •

edited

Loading

l0rinc left a comment •

edited

Loading

l0rinc Oct 23, 2024 •

edited

Loading

l0rinc Oct 23, 2024 •

edited

Loading

furszy left a comment •

edited

Loading

l0rinc commented Oct 24, 2024 •

edited

Loading

andrewtoth commented Dec 4, 2024 •

edited

Loading