Skip to content

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented Feb 27, 2024

This is a simple draft implementation of Schnorr signature batch verification in blocks, put here for conceptual discussion, and collaborative benchmarking. Most of code added here is from the secp256k1 implementation.

The secp256k1 code is still under review itself, please spend some time giving feedback on the API etc. if you can:

TODOs

  • Fix functional/feature_taproot.py
  • Extensive benchmarking using ConnectBlock() tracing
  • Add unit tests
  • Batch taproot tweak verification as well
  • Conceptual discussion in what form this realistically could be introduced into the code base, and later, a release

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 27, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/29491.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33101 (cmake: Proactively avoid use of SECP256K1_DISABLE_SHARED by hebasto)
  • #32724 (Musig2 tests by w0xlt)
  • #32575 (refactor: Split multithreaded case out of CheckInputScripts by fjahr)
  • #32317 (kernel: Separate UTXO set access from validation functions by TheCharlatan)
  • #31244 (descriptors: MuSig2 by achow101)
  • #31132 (validation: fetch block inputs on parallel threads 10% faster IBD by andrewtoth)
  • #29675 (wallet: Be able to receive and spend inputs involving MuSig2 aggregate keys by achow101)
  • #29247 (CAT in Tapscript (BIP-347) by arminsabouri)
  • #28690 (build: Introduce internal kernel library by TheCharlatan)
  • #28201 (Silent Payments: sending by josibake)
  • #28122 (Silent Payments: Implement BIP352 by josibake)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • In include/secp256k1_batch.h comment: “batch_add_xonpub_tweak_check” → “batch_add_xonlypub_tweak_check” [function name typo]
  • In src/secp256k1/src/bench.c printf: “Schnorr sigining algorithm” → “Schnorr signing algorithm” [spelling error]

drahtbot_id_4_m

@fjahr fjahr force-pushed the 2024-02-batch-validation-updated branch from 454783c to 0a5e612 Compare February 27, 2024 17:10
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/22041704016

@Sjors
Copy link
Member

Sjors commented Feb 28, 2024

I guess this is most interesting to test against the last 50K blocks on mainnet, i.e. since early 2023? Given the high percentage of taproot transactions. Such testing could benefit from a assume utxo snapshot, saving everyone the pain of rewinding. I can bake one, though ideally I'd like #26045 to land first.

Is the current PR already using batches? It's unclear to me by glancing over validation.cpp if and how it's collecting up to max_batch_size{106} signatures before calling batch.Verify().

Maybe add a -schnorrbatchverify startup argument so more people can try it once it gets through review.

@fjahr
Copy link
Contributor Author

fjahr commented Feb 28, 2024

I guess this is most interesting to test against the last 50K blocks on mainnet, i.e. since early 2023? Given the high percentage of taproot transactions. Such testing could benefit from a assume utxo snapshot, saving everyone the pain of rewinding. I can bake one, though ideally I'd like #26045 to land first.

Yeah, that would be great. I have to get the tracing part to work in my setup again before I can start.

Is the current PR already using batches? It's unclear to me by glancing over validation.cpp if and how it's collecting up to max_batch_size{106} signatures before calling batch.Verify().

Yes, the rough architecture is as as follows (sorry for not writing a better explanation above but it's still very rough and I expect it to change). So, the batch object is constructed and then handed down from ConnectBlock() to CheckInputScripts() which hands it to the CScriptCheck constructor. When the CScriptCheck instance is executed and the batch object is present BatchingCachingTransactionSignatureChecker is used which only differs from the default CachingTransactionSignatureChecker in that its implementation of VerifySchnorrSignature() and adds the Schnorr sig to the batch object instead of verifying it right there. (Here we have an issue because the function is not doing what the name says anymore but it is a much simpler change this way and I find it hard to predict where we will land with this in the end.) Then back in ConnectBlock() the batch object is verified after all transactions have been iterated over.

The 106 is an implementation detail from secp256k1 that I am not sure needs to be really exposed here. But what matters for the question is: if there are more than 106 sigs added to the batch the verification for the already added sigs will happen when the next sig is added and if the verification fails the adding itself will fail. So, a callback to the last paragraph, every 106 sigs the naming of VerifySchnorrSignature() is actually still correct ;) and the one verify call in the end just verifies the last n % 106 sigs left for that block. I did not put any effort in documenting this properly here yet because the secp256k1 API is not finalized yet and the details on this particular topic might still change, see for example bitcoin-core/secp256k1#1134 (review)

Everything should be conditional on the presence of a batch object which defaults to nullptr and if that is the case all the other functions should work as usual, for example when used outside of the context of a new block.

Maybe add a -schnorrbatchverify startup argument so more people can try it once it gets through review.

Yeah, I have been thinking that or maybe even a compiler argument

@fjahr fjahr force-pushed the 2024-02-batch-validation-updated branch from 0a5e612 to 1825555 Compare February 28, 2024 13:14
@fjahr
Copy link
Contributor Author

fjahr commented Feb 28, 2024

@Sjors maybe what confused you is that BatchingCachingTransactionSignatureChecker::VerifySchnorrSignatureBatch() was unused. That was actually a leftover from an earlier design spike and I just forgot to remove it. It's gone now.

@Sjors
Copy link
Member

Sjors commented Feb 28, 2024

if there are more than 106 sigs added to the batch the verification for the already added sigs will happen when the next sig is added and if the verification fails the adding itself will fail.

That clarifies a lot, and should be documented :-)

@0xB10C
Copy link
Contributor

0xB10C commented May 13, 2024

@willcl-ark and I ran a IBD benchmark vs master for this PR: IBD from a local node to height 840000 and currently only looking at the connect_block duration.

In the current draft implementation:

  • master is currently faster
  • batch validation is attempted and decreases performance even before taproot activated
  • after taproot activation, batch validation is significantly slower than non-batch validation

image

image

Used this bpftrace script to collect CSV data:

#!/usr/bin/env bpftrace

usdt:./src/bitcoind:validation:block_connected
{
  $height = (int32) arg1;
  $tx = (uint64) arg2;
  $ins = (int32) arg3;
  $sigops = (int64) arg4;
  $duration = (int64) arg5; // in nanoseconds

  printf("%d,%ld,%d,%ld,%ld\n", $height, $tx, $ins, $sigops, $duration);
}

@willcl-ark
Copy link
Member

batch validation is attempted and decreases performance even before taproot activated

From a quick skim of the changes ISTM that we always use the BatchingCachingTransactionSignatureChecker, and there was no switch to the CachingTransactionSignatureChecker, but this could well be because this is only in WIP state. It doesnt account for why it's currently slower in all cases though.

@fjahr fjahr force-pushed the 2024-02-batch-validation-updated branch from 1825555 to 6f271d7 Compare December 5, 2024 13:57
@DrahtBot DrahtBot mentioned this pull request Dec 6, 2024
@andrewtoth
Copy link
Contributor

It seems in this approach the m_batch_mutex is held for the entirety of adding, which will cause lock contention. Verify is then a blocking call on the main thread, which negates the multithreading.

I think a better approach would be to have a thread local batch for each worker thread and the main thread, so each thread can add all schnorr sigs without locking, and then each thread can verify their batches in parallel at the end.

@fjahr fjahr force-pushed the 2024-02-batch-validation-updated branch from 6f271d7 to cafdd14 Compare March 11, 2025 23:43
@fjahr fjahr changed the title [DO NOT MERGE] Schnorr batch verification for blocks [EXPERIMENTAL] Schnorr batch verification for blocks Mar 11, 2025
@Eunovo
Copy link
Contributor

Eunovo commented May 6, 2025

Can you share the setup for these benchmarks so I can test them? Thanks!

OS is Ubuntu 24.04.2 LTS

Output from lspcu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen 9 8945HS w/ Radeon 780M Graphics
    CPU family:           25
    Model:                117
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             2
    Frequency boost:      enabled
    CPU(s) scaling MHz:   83%
    CPU max MHz:          5263.0000
    CPU min MHz:          400.0000

I'm running the IBD benchmark on a different machine. I'll share details when results are in.

EDIT
It seems you were referring to the command I ran. I used ConnectBlockAllSchnorr microbenchmark. Run it with build/bin/bench_bitcoin -filter=ConnectBlockAllSchnorr . For IBD benchmarking, I'm doing bitcoind -assumevalid=0 -par={par} -stopatheight=880000(I'm testing 1, 2, 4, 6, 8, 12 threads) with Benchkit.

Note that Benchkit is still experimental, it loads the 840000 assumeutxo snapshot and runs hyperfine

@fjahr fjahr force-pushed the 2024-02-batch-validation-updated branch from 4ab16a0 to 4d9834e Compare May 18, 2025 23:57
@fjahr
Copy link
Contributor Author

fjahr commented May 18, 2025

It seems you were referring to the command I ran. I used ConnectBlockAllSchnorr microbenchmark. Run it with build/bin/bench_bitcoin -filter=ConnectBlockAllSchnorr . For IBD benchmarking, I'm doing bitcoind -assumevalid=0 -par={par} -stopatheight=880000(I'm testing 1, 2, 4, 6, 8, 12 threads) with Benchkit.

I couldn't figure out how to test the different thread configurations for the microbench with Benchkit. Are you using Benchkit there as well or are you just setting worker_threads_num in test/util/setup_common.cpp? That's what I have been doing for now.

Copy operations in https://github.com/bitcoin/bitcoin/pull/29491/files#diff-612abe4a74f2e9f6903cebff057d47bfe4f8a57507c175743f4b62fde0d3cc58R95-R99

I am not sure how to do this completely without copy, using span is unsafe because vChecks is cleared. Using emplace_back is possible and I am using it now but in the benchmarks I didn’t see noticable improvement from this. Can you be a bit more specific what you had in mind?

Doing batch-verification work within the critical section https://github.com/bitcoin/bitcoin/pull/29491/files#diff-0f68d4063c2fb11000f9048e46478e26c0ea2622f7866d00b574f429c3a4db61R151

I have implemented this and see a performance boost from it for the case with 15 worker threads (I didn't test all the other thread configuration yet). On my machine I now get these results (running build/bin/bench_bitcoin -filter=ConnectBlockAllSchnorr -min-time=1000):

novo (15 worker threads)

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        9,285,356.00 |              107.70 |    1.6% |      1.09 | `ConnectBlockAllSchnorr`

this latest state 9d460e82e1404f7cfb29613d3852bbb99e0071ad (15 worker threads)

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        7,973,364.58 |              125.42 |    1.3% |      1.10 | `ConnectBlockAllSchnorr`

Thank you for the continued valuable feedback, I have made you co-author of the relevant commit.

A few comments on the latest code in the branch you linked to (0493ba3):

  • Tracking if anything has been added to the batch and skipping verify makes sense as an idea but I have some issues with it: First of all, using the return value of secp256k1_batch_add_schnorrsig seems wrong because this will return false if something has gone wrong in adding the signature and this will usually mean that we have a problem, at least in the block verification process it definitely means that. So taking that to mean we don’t have to verify anymore, it’s fine doesn’t seem right. Additionally in my opinion we shouldn’t have to track this in bitcoin core because I believe secp256k1 should handle this. Such behavior should happen closest to the actual data IMO and if this is actually the case calling verify should be costless. In the current secp code the actual verification steps seem to be skipped here: https://github.com/bitcoin-core/secp256k1/pull/1134/files#diff-5a633b880f8591184912dc3e49b3dffc2af714b2fc27404d0a1b12f6a2df7a71R191 so it would be surprising to me if adding m_needs_verify gives a real performance boost. Did you benchmark this change in isolation and notice a performance impact? If you did see that I would be interested to dig into why that is. And fwiw, I could also be pursuaded to not trust secp with this and track this ourselves if this is what we usually do in core but I will need to look for some comparable reference code and understand the reasoning for this (seemingly) duplicate behavior.
  • IIUC you are using queue.empty() as the indicator that all the todos have been added, if I remember correctly I tried this at some point too but it turned out that there is no guarantee for that and it could be that the todos are just not added fast enough, hence the need for m_complete. So this means some threads could be verifying multiple times within a block. I don’t see why this would be causing problems with the verification though, I think the result is still correct in this regard. But the question is how this impacts performance. Here as well I would find it kind of surprising if this is faster in practice but maybe my thinking is wrong and your benchmarks tell us we should switch to this approach instead, letting threads verify their batch when they have the chance instead of letting them go to sleep. But I also think we need to wait until we have pippenger as well because that could influence these results significantly I think.
  • Running the loop one more time only with nNow = 0 to trigger the batch verification is very simple but I don’t see how you ensure that all the thread batches have been verified before the master thread returns. It looks to me like as soon as the master thread has verified its batch it simply returns, no matter how many (or any) other threads have verified their batches already and any failures in these batches would be missed. This is what I track with m_total_verified here in the code. Additionally, setting nNow = 0 has the unwanted side-effect that in the cleanup phase of the next iteration the loop seems to be in the very first iteration again and this leads to running nTotal++; giving the impression that there are more threads available. I guess this doesn’t cause issues only because there is no tracking of how many of the total threads have already verified their batch.

A few more comments on this latest push:

  • I have also been warming up to the idea of having the thread local batch object in Loop rather than in a checkqueue member, this saves us a few lines of code and I didn't see an impact on performance.
  • Additionally I have been looking into some c++20 multithreading features since the latest push and tested using std::barrier instead of the counter but so far I didn't see an improvement in performance or code clarity from it.
  • I have added a unit test for BatchSchnorrVerifier.
  • Just FYI, I don't think rebasing currently makes sense for me here because this would require rebasing the secp batch branch and the author is working on doing that shortly themselves so I would rather wait for that.

@Eunovo
Copy link
Contributor

Eunovo commented Jun 12, 2025

I couldn't figure out how to test the different thread configurations for the microbench with Benchkit. Are you using Benchkit there as well or are you just setting worker_threads_num in test/util/setup_common.cpp? That's what I have been doing for now.

Sorry I didn't state that. That's what I did for the microbench. I only use Benchkit for IBD bechmarks.

I'm going to review and test latest changes.

@Eunovo
Copy link
Contributor

Eunovo commented Jun 12, 2025

I am not sure how to do this completely without copy, using span is unsafe because vChecks is cleared. Using emplace_back is possible and I am using it now but in the benchmarks I didn’t see noticable improvement from this. Can you be a bit more specific what you had in mind?

I was thinking it might be better to move away from the collecting sigs idea if the copy operations present a significant slowdown. I'll see if I can detect any performance changes in testing.

  • Tracking if anything has been added to the batch and skipping verify makes sense as an idea but I have some issues with it: First of all, using the return value of secp256k1_batch_add_schnorrsig seems wrong because this will return false if something has gone wrong in adding the signature and this will usually mean that we have a problem, at least in the block verification process it definitely means that. So taking that to mean we don’t have to verify anymore, it’s fine doesn’t seem right. Additionally in my opinion we shouldn’t have to track this in bitcoin core because I believe secp256k1 should handle this. Such behavior should happen closest to the actual data IMO and if this is actually the case calling verify should be costless. In the current secp code the actual verification steps seem to be skipped here: https://github.com/bitcoin-core/secp256k1/pull/1134/files#diff-5a633b880f8591184912dc3e49b3dffc2af714b2fc27404d0a1b12f6a2df7a71R191 so it would be surprising to me if adding m_needs_verify gives a real performance boost. Did you benchmark this change in isolation and notice a performance impact? If you did see that I would be interested to dig into why that is. And fwiw, I could also be pursuaded to not trust secp with this and track this ourselves if this is what we usually do in core but I will need to look for some comparable reference code and understand the reasoning for this (seemingly) duplicate behavior.

Makes sense. I agree. We do not need to do this in core.

  • IIUC you are using queue.empty() as the indicator that all the todos have been added, if I remember correctly I tried this at some point too but it turned out that there is no guarantee for that and it could be that the todos are just not added fast enough, hence the need for m_complete. So this means some threads could be verifying multiple times within a block. I don’t see why this would be causing problems with the verification though, I think the result is still correct in this regard. But the question is how this impacts performance. Here as well I would find it kind of surprising if this is faster in practice but maybe my thinking is wrong and your benchmarks tell us we should switch to this approach instead, letting threads verify their batch when they have the chance instead of letting them go to sleep. But I also think we need to wait until we have pippenger as well because that could influence these results significantly I think.
  • Running the loop one more time only with nNow = 0 to trigger the batch verification is very simple but I don’t see how you ensure that all the thread batches have been verified before the master thread returns. It looks to me like as soon as the master thread has verified its batch it simply returns, no matter how many (or any) other threads have verified their batches already and any failures in these batches would be missed. This is what I track with m_total_verified here in the code. Additionally, setting nNow = 0 has the unwanted side-effect that in the cleanup phase of the next iteration the loop seems to be in the very first iteration again and this leads to running nTotal++; giving the impression that there are more threads available. I guess this doesn’t cause issues only because there is no tracking of how many of the total threads have already verified their batch.

It seems like the major reason for performance drop was the batch-verification work being performed in the critical section. I'll try to reproduce your latest results.

}

secp256k1_xonly_pubkey pubkey_parsed;
(void)secp256k1_xonly_pubkey_parse(secp256k1_context_static, &pubkey_parsed, pubkey.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcd51a4: why discard the return value of secp256k1_xonly_pubkey_parse()?

fjahr and others added 12 commits July 24, 2025 16:48
410abb205a batch: Generate speedup graphs
e5df505bad batch, extrakeys: Add benchmarks
642901f57a batch: Add tests for batch_add_* APIs
6378b6bcdc batch,ecmult: Add tests for core batch APIs and strauss_batch refactor
e5bbca63bf batch: Add example
f818fc0e4b batch: Add batch_add_* APIs
b47d8f3877 batch, ecmult: Add batch_verify and refactor strauss_batch
535a94b6d2 batch: Add create and destroy APIs
800233a2d4 batch: Initialize an experimental batch module
REVERT: b9313c6e1a Merge bitcoin-core/secp256k1#1708: release cleanup: bump version after 0.7.0
REVERT: a660a4976e Merge bitcoin-core/secp256k1#1707: release: Prepare for 0.7.0
REVERT: 7ab8b0cc01 release cleanup: bump version after 0.7.0
REVERT: a3e742d947 release: Prepare for 0.7.0
REVERT: f67b0ac1a0 ci: Don't hardcode ABI version
REVERT: 020ee60495 Merge bitcoin-core/secp256k1#1706: musig/tests: initialize keypair
REVERT: cde4130898 musig/tests: initialize keypair
REVERT: 6037833c9e Merge bitcoin-core/secp256k1#1702: changelog: update
REVERT: 40b4a06520 changelog: update
REVERT: 5e74086dc8 Merge bitcoin-core/secp256k1#1705: musig/test: Remove dead code
REVERT: 7c3380423c Merge bitcoin-core/secp256k1#1696: build: Refactor visibility logic and add override
REVERT: 8d967a602b musig/test: Remove dead code
REVERT: 983711cd6d musig/tests: Refactor vectors_signverify
REVERT: 73a695958a Merge bitcoin-core/secp256k1#1704: cmake: Make `secp256k1_objs` inherit interface defines from `secp256k1`
REVERT: bf082221ff cmake: Make `secp256k1_objs` inherit interface defines from `secp256k1`
REVERT: c82d84bb86 build: add CMake option for disabling symbol visibility attributes
REVERT: ce7923874f build: Add SECP256K1_NO_API_VISIBILITY_ATTRIBUTES
REVERT: e5297f6d79 build: Refactor visibility logic
REVERT: cbbbf3bd6e Merge bitcoin-core/secp256k1#1699: ci: enable musig module for native macOS arm64 job
REVERT: 943479a7a3 Merge bitcoin-core/secp256k1#1694: Revert "cmake: configure libsecp256k1.pc during install"
REVERT: 3352f9d667 ci: enable musig module for native macOS arm64 job
REVERT: ad60ef7ea7 Merge bitcoin-core/secp256k1#1689: ci: Convert `arm64` Cirrus tasks to GHA jobs
REVERT: c498779096 Merge bitcoin-core/secp256k1#1687: cmake: support the use of launchers in ctest -S scripts
REVERT: 44b205e9ee Revert "cmake: configure libsecp256k1.pc during install"
REVERT: 0dfe387dbe cmake: support the use of launchers in ctest -S scripts
REVERT: 89096c234d Merge bitcoin-core/secp256k1#1692: cmake: configure libsecp256k1.pc during install
REVERT: 7106dce6fd cmake: configure libsecp256k1.pc during install
REVERT: 004f57fcd8 ci: Move Valgrind build for `arm64` from Cirrus to GHA
REVERT: 5fafdfc30f ci: Move `gcc-snapshot` build for `arm64` from Cirrus to GHA
REVERT: e814b79a8b ci: Switch `arm64_debian` from QEMU to native `arm64` Docker image
REVERT: bcf77346b9 ci: Add `arm64` architecture to `docker_cache` job
REVERT: b77aae9226 ci: Rename Docker image tag to reflect architecture

git-subtree-dir: src/secp256k1
git-subtree-split: 410abb205a93b5c84a00a4e9e478c852b6dc6d69
… verification

Co-authored-by: Eunovo <eunovo9@gmail.com>
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants