-
Notifications
You must be signed in to change notification settings - Fork 37.7k
bench: replace benchmark block with more representative one (413567 → 784588) #32457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit documents the current benchmark-base block's properties, to highlight the differences with the replacement block in the next commit.
https://mempool.space/block/413567 was mined in 2016, added as a benchmark-base in bitcoin#9049. It lacks modern script types, making the benchmarks unrepresentative of current usage. In this commit we're replacing it with https://mempool.space/block/784588 from 2023. This block was selected because it's old enough to include legacy script types encountered during IBD, while also containing modern script types in proportions that better reflect current block composition.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32457. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Agree with the rationale of this PR, but having 1MB+ binary files in the repo is really meh. |
Agree - do you have a better idea? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benchmark that needs this? If yes, going for synthetic, but representative (and easily adjustable) data may be a better choice for that benchmark.
#include <span.h> | ||
#include <util/strencodings.h> | ||
|
||
#include <vector> | ||
|
||
static void HexStrBench(benchmark::Bench& bench) | ||
{ | ||
auto const& data = benchmark::data::block413567; | ||
auto const& data = benchmark::data::block_784588; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a block, this could just be random bytes from a fast random context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this one definitely, but in the other cases I'm worried about introducing a strong bias.
It's not like we're changing these very often - but I'll investigate anyway, let's see how close we can get without adding 1.5 Mb to the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not just the amount of data, we're still scared from the xz backdoor incident 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understandable, but that's why I added the hashes here, to make it self-validating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha agree it would be extremely far-fetched to put data in a specific block, just to add it in the repository two years later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this one definitely, but in the other cases I'm worried about introducing a strong bias.
Again, it would be good to list the benchmark that needs this. Also, serialization itself shouldn't care if the data is synthetic (random) or if it exactly matches a real past block. If you worry about a bias, it should actually be easier to provide synthetic data, than to try to find a fitting past block. In any case, there will always be a bias, even if the data is fully synthetic, as the real chain progresses and we probably don't want to update this for every release. For the benchmarks where it doesn't matter, I'd say to just leave them as-is. For the benchmarks where it matters, it would be good to explain why and then find a solution for each benchmark.
Yes, as there is a lot of random data in a block whose exact value isn't important to benchmarking (only that it's always the same), it seems possible to deterministically construct a similar block from code. |
Added a random block generator in #32554 - let me know if it makes sense so I can close this one. |
Closing in favor of #32554 |
Draft, until I investigate if we can generate a similar block instead of adding a real one to the repo
Summary
This PR replaces our benchmark's reference block with one that's more modern and representative of current usage patterns.
Context
The current benchmark block was mined in 2016 and added in PR #9049. Since it predates many modern script types, our benchmarks don't accurately reflect current network conditions.
Suggestion
We're replacing it with block 784588 from 2023, which provides a better balance - it's recent enough to include modern script types while still containing legacy scripts typically encountered during IBD.
The PR consists of two commits: