Skip to content

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented May 9, 2025

Draft, until I investigate if we can generate a similar block instead of adding a real one to the repo


Summary

This PR replaces our benchmark's reference block with one that's more modern and representative of current usage patterns.

Context

The current benchmark block was mined in 2016 and added in PR #9049. Since it predates many modern script types, our benchmarks don't accurately reflect current network conditions.

Suggestion

We're replacing it with block 784588 from 2023, which provides a better balance - it's recent enough to include modern script types while still containing legacy scripts typically encountered during IBD.

The PR consists of two commits:

  • first documenting the current block's script type distribution;
  • then replacing it with the new block and updating assertions accordingly.

l0rinc added 2 commits May 9, 2025 11:14
This commit documents the current benchmark-base block's properties, to highlight the differences with the replacement block in the next commit.
https://mempool.space/block/413567 was mined in 2016, added as a benchmark-base in bitcoin#9049.
It lacks modern script types, making the benchmarks unrepresentative of current usage.
In this commit we're replacing it with https://mempool.space/block/784588 from 2023. This block was selected because it's old enough to include legacy script types encountered during IBD, while also containing modern script types in proportions that better reflect current block composition.
@DrahtBot
Copy link
Contributor

DrahtBot commented May 9, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32457.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32554 (RFC: bench: replace embedded raw block with configurable block generator by l0rinc)
  • #32532 (script: short-circuit GetLegacySigOpCount for known scripts by l0rinc)
  • #31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the Tests label May 9, 2025
@laanwj
Copy link
Member

laanwj commented May 9, 2025

Agree with the rationale of this PR, but having 1MB+ binary files in the repo is really meh.

@l0rinc
Copy link
Contributor Author

l0rinc commented May 9, 2025

Agree - do you have a better idea?

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benchmark that needs this? If yes, going for synthetic, but representative (and easily adjustable) data may be a better choice for that benchmark.

#include <span.h>
#include <util/strencodings.h>

#include <vector>

static void HexStrBench(benchmark::Bench& bench)
{
auto const& data = benchmark::data::block413567;
auto const& data = benchmark::data::block_784588;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a block, this could just be random bytes from a fast random context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this one definitely, but in the other cases I'm worried about introducing a strong bias.
It's not like we're changing these very often - but I'll investigate anyway, let's see how close we can get without adding 1.5 Mb to the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just the amount of data, we're still scared from the xz backdoor incident 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable, but that's why I added the hashes here, to make it self-validating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha agree it would be extremely far-fetched to put data in a specific block, just to add it in the repository two years later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this one definitely, but in the other cases I'm worried about introducing a strong bias.

Again, it would be good to list the benchmark that needs this. Also, serialization itself shouldn't care if the data is synthetic (random) or if it exactly matches a real past block. If you worry about a bias, it should actually be easier to provide synthetic data, than to try to find a fitting past block. In any case, there will always be a bias, even if the data is fully synthetic, as the real chain progresses and we probably don't want to update this for every release. For the benchmarks where it doesn't matter, I'd say to just leave them as-is. For the benchmarks where it matters, it would be good to explain why and then find a solution for each benchmark.

@laanwj
Copy link
Member

laanwj commented May 9, 2025

Is there a benchmark that needs this? If yes, going for synthetic, but representative (and easily adjustable) data may be a better choice for that benchmark.

Yes, as there is a lot of random data in a block whose exact value isn't important to benchmarking (only that it's always the same), it seems possible to deterministically construct a similar block from code.

@l0rinc
Copy link
Contributor Author

l0rinc commented May 18, 2025

Added a random block generator in #32554 - let me know if it makes sense so I can close this one.

@l0rinc
Copy link
Contributor Author

l0rinc commented May 21, 2025

Closing in favor of #32554

@l0rinc l0rinc closed this May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants