Skip to content

Conversation

matthew1001
Copy link
Contributor

@matthew1001 matthew1001 commented Aug 16, 2024

PR description

Introduces a new (experimental) "Bonsai Archive" DB mode which creates a full archive of the chain it syncs with. This allows JSON/RPC calls to be made with historic blocks as context, for example eth_getBalance to get the balance of an account at a historic block, or eth_call to simulate a transaction at a given block in history.

The PR is intended to provide part of the function currently offered by the (now deprecated) FOREST DB mode. Specifically it allows state to be queried at an arbitrary block in history, but does not currently offer eth_getProof for said state. A subsequent PR will implement eth_getProof for historic blocks.

Summary of the overall design & changes

This PR builds on PR #5865 which proved the basic concept of archiving state in the Bonsai flat DB by suffixing entries with the block in which they were changed.

For example the state for account 0x0e79065B5F11b5BD1e62B935A600976ffF3754B9 at block 37834 is stored as

<account-hash><block-num-hex> = 0x9ab656e8fa2a1029964289c9a189083db258ca4b46ebaa374477e069b8f47dec00000000000093ca

In order to minimise performance degradation over time, historic state and storage entries in the DB are "archived" by moving them into a separate DB segment.

Where account state is stored in segment ACCOUNT_INFO_STATE, state that has been archived is stored in ACCOUNT_INFO_STATE_ARCHIVE. Likewise where storage is held in segment ACCOUNT_STORAGE_STORAGE, archived storage entries are stored in ACCOUNT_STORAGE_ARCHIVE.

An example Rocks DB query to retrieve the state of the example account above would be:

ldb --db=. get --column_family=ACCOUNT_INFO_STATE_ARCHIVE --key_hex --value_hex 0x9ab656e8fa2a1029964289c9a189083db258ca4b46ebaa374477e069b8f47dec00000000000093ca

Creating a Bonsai Archive node

The PR introduces an entirely new data storage format (as opposed to making it a configuration option of the existing BONSAI storage format.

To create a bonsai archive node simply set --data-storage-format=x_bonsai_archive when creating it.

An existing FOREST or BONSAI node cannot be migrated to BONSAI_ARCHIVE mode.

Storage requirements

An archive node intrinsically requires more storage space than a non-archive node. Every state update is retained in the archive DB segments as outlined above. An archive node for the holesky testnet as of the raising of this PR requires approximately 160Gi of storage.

Sync time

In order to create an archive of an entire chain, FULL sync mode must be used. This PR does not prevent SNAP syncing an archive node, but this will result in only a partial archive of the chain.

While the node is performing a FULL sync with the chain it is also migrating entries from the regular DB segments to the archive DB segments. Overall this increases the time to create the archive node. For a public chain this might require 1 week or more to complete syncing and archiving.

@matthew1001 matthew1001 force-pushed the multi-version-flat-db-rebase branch 2 times, most recently from 88f3968 to 7d4a524 Compare August 20, 2024 10:19
@matthew1001 matthew1001 changed the title Multi version flat db rebase Bonsai archive feature Sep 4, 2024
@matthew1001 matthew1001 force-pushed the multi-version-flat-db-rebase branch 7 times, most recently from 782ae60 to 5752732 Compare October 2, 2024 17:10
@matthew1001 matthew1001 force-pushed the multi-version-flat-db-rebase branch 4 times, most recently from 5b06b50 to dce531e Compare October 4, 2024 16:11
@matthew1001 matthew1001 marked this pull request as ready for review October 7, 2024 16:31
jframe and others added 15 commits October 8, 2024 08:38
Signed-off-by: Jason Frame <jason.frame@consensys.net>
…se constructor that reuses worldStateStorage so that we don't lose values in the EvmToolSpecTests

Signed-off-by: Jason Frame <jason.frame@consensys.net>
Signed-off-by: Jason Frame <jason.frame@consensys.net>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
…d state, and freeze it

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
…ten for blocks and move account state to new DB segment

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
…t block state has been frozen for

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
…age from the freezer segment

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
macfarla added a commit to macfarla/besu that referenced this pull request Mar 18, 2025
Signed-off-by: Sally MacFarlane <macfarla.github@gmail.com>
macfarla added a commit that referenced this pull request Mar 28, 2025
* check for null

Signed-off-by: Sally MacFarlane <macfarla.github@gmail.com>

* more nuanced change with help from #7475

Signed-off-by: Sally MacFarlane <macfarla.github@gmail.com>

* changelog

Signed-off-by: Sally MacFarlane <macfarla.github@gmail.com>

---------

Signed-off-by: Sally MacFarlane <macfarla.github@gmail.com>
Copy link

This pr is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the Stale label Apr 11, 2025
Copy link

This pr was closed because it has been inactive for 14 days since being marked as stale.

@matthew1001 matthew1001 force-pushed the multi-version-flat-db-rebase branch from 17c2d8b to 9079602 Compare May 29, 2025 09:08
Copy link
Contributor

@matkt matkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some small comments,

I will prefer to have only BONSAI with a archive flag, but maybe we can do that in another PR.

For me it's ok, I will wait for the feedback from @garyschulte

.getComposedWorldStateStorage()
.startTransaction();
genesisStateTX.put(
TRIE_BRANCH_STORAGE, WORLD_BLOCK_NUMBER_KEY, Bytes.ofUnsignedLong(0).toArrayUnsafe());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that this value is initialized to zero for a node that was already fully synced before this PR and then restarted with the new code?

Maybe instead of checking if this specific field is empty, we should check if the node is already synced? That way, a previously synced node would just update the value during the next block import, avoiding incorrect initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the code currently supports starting an already-synced node with a new data storage format. If the node previously started with --data-storage-format=BONSAI then starting it with X_BONSAI_ARCHIVE would fail to start. So I believe this is safe for any node that is using the new storage format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about a node started with bonsai and restarting with Bonsai and this PR. it seems we are calling ensureGenesisArchiveContext also in bonsai mode . no ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I don't think we'd be calling writeAccountsTo() on genesis state for an already-synced Bonsai node would we?

.map(MutableWorldState::disableTrie)
.flatMap(
worldState ->
rollMutableArchiveStateToBlockHash( // This is a tiny action for archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

accountFound =
storage
.getNearestBefore(ACCOUNT_INFO_STATE_ARCHIVE, keyNearest)
.filter(found -> accountHash.commonPrefixLength(found.key()) >= accountHash.size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can do that only one time accountHash.commonPrefixLength(found.key())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean generate accountHash.commonPrefixLength(found.key()) before calling .getNearestBefore, so the code becomes something like:

int comPrefLength = accountHash.commonPrefixLength(found.key());
...

      .filter(found -> comPrefLength >= accountHash.size())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realised that's not possible since found doesn't exist until we're streaming the results. So I'm not quite sure what change you're suggesting?

.filter(
found ->
!Arrays.areEqual(
DELETED_ACCOUNT_VALUE, found.value().orElse(DELETED_ACCOUNT_VALUE)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is important to keep the deleted value if we have ACCOUNT_INFO_STATE_ARCHIVE ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is important to keep the deleted value if we have ACCOUNT_INFO_STATE_ARCHIVE ?

No, I think that's a good point. I've updated the logic in a new commit so the deleted key filter is applied regardless of which DB table the entry is retrieved from.

@matthew1001
Copy link
Contributor Author

I will prefer to have only BONSAI with a archive flag, but maybe we can do that in another PR.

Yeah I think there was some discussion about that early on in the archive work and IIRC Gary favoured a specific data storage format rather than a config option on BONSAI.

Copy link
Contributor

@matkt matkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just waiting for Gary's approval also before merging

@matthew1001
Copy link
Contributor Author

Just waiting for Gary's approval also before merging

@garyschulte what do you think?

Signed-off-by: Matthew Whitehead <matthew.whitehead@kaleido.io>
Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great. I left comments and questions regarding feature/function choices, but all of the changes seem to be compartmentalized to the new experimental archive storage mode, so we can address here or in a future PR IMO.

For the record, I think we need a separate storage format for archive, as you have it in this PR. The archive segments, metadata, and behavior are significantly different to BONSAI. If we don't differentiate between them as separate formats the potential to corrupt a very-expensive-to-sync archive is just one accidental config change away.

I have my misgivings about communicating the block context via a worldstate storage key (rather than treating it as an ephemeral parameter), BUT it seems the implications of that behavior are limited only to X_BONSAI_ARCHIVE. We can discover how this works testing operationally.

FWIW, I have a couple test nodes using this PR - a full sync bonsai node that is right around 12 mil, and an archive node started at the same time that is around 6 mil. There are probably opportunities for performance improvement in future PRs.

Thanks for keeping this PR updated and waiting for prague 🙏

@@ -118,6 +118,7 @@ public void run() {
switch (dataStorageFormat) {
case FOREST -> 1;
case BONSAI -> 2;
case X_BONSAI_ARCHIVE -> 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably ok, but we should never have a bonsai archive node with the prior metadata format, since the metadata change predates the bonsai archive feature. It might make sense to just throw here.

Comment on lines +68 to +72
} else if (dataStorageConfiguration.getDataStorageFormat()
== DataStorageFormat.X_BONSAI_ARCHIVE) {
LOG.info("setting FlatDbStrategy to ARCHIVE");
transaction.put(
TRIE_BRANCH_STORAGE, FLAT_DB_MODE, FlatDbMode.ARCHIVE.getVersion().toArrayUnsafe());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to do this? if we are transitioning from partial to archive, we implicitly do not have a real archive node. Perhaps there could be some value in running an archive node "since" a block height, but I would imagine an archive sync strategy would be a better fit than a snap sync strategy that transitions to archive when complete 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the archive + proofs PR #8669 does some refactoring in this area because I also wasn't totally happy with the way partial/full/archive were handled here. So for now I think I'm going to elect to leave it unchanged in this PR because I think the follow on PR much improves things.

Comment on lines +139 to +145
// If there isn't a match look in the archive DB segment
if (nearestAccount.isEmpty()) {
accountFound =
storage
.getNearestBefore(ACCOUNT_INFO_STATE_ARCHIVE, keyNearest)
.filter(found -> accountHash.commonPrefixLength(found.key()) >= accountHash.size());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are looking in the context of MAX_BLOCK_SUFFIX, we shouldn't have to defer to archive right? current/unchanged account values should not be moved to the archive unless something has gone wrong with archiving right? if we can skip this step we should see substantially faster "zero read" cases non-historical queries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed. I think calculateArchiveKeyWithMaxSuffix() needs renaming as it may use MAX_BLOCK_SUFFIX, but only if an archive context block isn't set. So I think some additional logic to check if we ended up defaulting to MAX_BLOCK_SUFFIX and then skipping the archive lookup is what's needed (plus the possible function rename). If it's OK I'd like to start getting other people looking at the code, and tracking archive bugs etc. as standalone issues, so for this and the other comment I'm going to raise an issue and not fix in the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, definitely reasonable to defer to a subsequent issue. I think not only can we skip archiving for MAX_BLOCK_SUFFIX, but also for block heights that are above the high water mark for being archived. If we expose the current archive block number via bonsai context, we can make that decision very cheaply, and get a significant boost for zero reads 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've raised #8850 to track this fix

Comment on lines +333 to +334
// If there isn't a match look in the archive DB segment
if (nearestStorage.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing as above comment regarding zero-read performance.

Comment on lines +198 to +199
// TODO - rename calculateRootHash() to be clearer that it updates state, it doesn't just
// calculate a hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only half true, since an Optional.empty() state updater would mean no update. but yeah it is kinda side-effect-y

Signed-off-by: Matthew Whitehead <matthew.whitehead@kaleido.io>
Signed-off-by: Matt Whitehead <matthew.whitehead@kaleido.io>
Signed-off-by: Matthew Whitehead <matthew.whitehead@kaleido.io>
@matthew1001 matthew1001 enabled auto-merge (squash) June 23, 2025 09:35
@matthew1001 matthew1001 merged commit a9957ce into hyperledger:main Jun 23, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants