Drop head tracker for summaries DAG #6744

dapplion · 2024-12-28T13:46:38Z

Issue Addressed

The head tracker is a persisted piece of state that must be kept in sync with the fork-choice. It has been a source of pruning issues in the past, so we want to remove it

see Remove head tracker in favour of fork choice #1785

When implementing tree-states in the hot DB we have to change the pruning routine (more details below) so we want to do those changes first in isolation.

see Database growth during non-finality #6580
If you want to see the full feature of tree-states hot HDiff in the hot DB dapplion/lighthouse#39

Closes #1785

Proposed Changes

Current DB migration routine

Locate abandoned heads with head tracker
Use a roots iterator to collect the ancestors of those heads can be pruned
Delete those abandoned blocks / states
Migrate the newly finalized chain to the freezer

In summary, it computes what it has to delete and keeps the rest. Then it migrates data to the freezer. If the abandoned forks routine has a bug it can break the freezer migration.

Proposed migration routine (this PR)

Migrate the newly finalized chain to the freezer
Load all state summaries from disk
From those, just knowing the head and finalized block compute two sets: (1) descendants of finalized (2) newly finalized chain
Iterate all summaries, if a summary does not belong to set (1) or (2), delete

This strategy is more sound as it just checks what's there in the hot DB, computes what it has to keep and deletes the rest. Because it does not rely and 3rd pieces of data we can drop the head tracker and pruning checkpoint. Since the DB migration happens first now, as long as the computation of the sets to keep is correct we won't have pruning issues.

dapplion · 2024-12-28T13:51:37Z

@michaelsproul If we stop updating the head_tracker in the persisted head, how can we downgrade safely? If you run an old version of Lighthouse it will read the persisted head which contains an empty head tracker.

dapplion · 2025-01-02T10:38:27Z

Test store_tests::finalizes_non_epoch_start_slot panics at

---- store_tests::finalizes_non_epoch_start_slot stdout ----
thread 'store_tests::finalizes_non_epoch_start_slot' panicked at /Users/lion/code/sigp/lighthouse/beacon_node/beacon_chain/src/test_utils.rs:836:14:
called `Result::unwrap()` on an `Err` value: HotColdDBError(MissingEpochBoundaryState(0x8da1c38d20071452e39a518001b61aa94ad871c6b95896203ddb0b68cfb89e5f))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: beacon_chain::test_utils::BeaconChainHarness<beacon_chain::builder::Witness<slot_clock::manual_slot_clock::ManualSlotClock,beacon_chain::eth1_chain::CachingEth1Backend<E>,E,Hot,Cold>>::hot_state_exists
   4: beacon_chain_tests::store_tests::finalizes_non_epoch_start_slot::{{closure}}
   5: tokio::runtime::runtime::Runtime::block_on
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This is part of an assertion in the test calling hot_state_exists, which now instead of returning an Ok(bool) returns an Err.

This state is not pruned, but is the new_finalized_state_hash

Jan 02 10:31:34.615 DEBG Starting database pruning, new_finalized_state_hash: 0x8da1c38d20071452e39a518001b61aa94ad871c6b95896203ddb0b68cfb89e5f, new_finalized_checkpoint: Checkpoint { epoch: Epoch(1), root: 0x401d573d920244cb9b9988da85f8093f6bedb1e68156774b87c173f6ded6623b }, module: beacon_chain::migrate:485
Jan 02 10:31:34.619 DEBG Extra pruning information, states_to_prune: {}, blocks_to_prune: {}, states_to_prune_count: 0, blocks_to_prune_count: 0, finalized_and_descendant_block_roots: 40, state_summaries_count: 40, newly_finalized_states_min_slot: 8, newly_finalized_state_roots: 1, newly_finalized_blocks_min_slot: 7, newly_finalized_blocks: 1, new_finalized_checkpoint: Checkpoint { epoch: Epoch(1), root: 0x401d573d920244cb9b9988da85f8093f6bedb1e68156774b87c173f6ded6623b }, module: beacon_chain::migrate:645

Michael: figure out which of the hot_state_exists cases triggers, plus what state root and slot

beacon_node/store/src/hot_cold_store.rs

mergify · 2025-01-31T05:45:38Z

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

beacon_node/beacon_chain/src/migrate.rs

dapplion · 2025-02-03T20:47:54Z

@michaelsproul I have

Fixed conflicts
Added migration
Remove any hot DB op from migrate_db, looks like an oversight when creating this PR initially
Addressed the case where the finalized block is skipped, now we keep only descendants of the finalized checkpoint

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/migrate.rs

chong-he · 2025-02-12T00:13:29Z

Done some testing using binary built with the commit 12fa5a8.

The node can fully sync an archived node, although this error appears from time to time:

ERRO Error fetching block for peer, error: DBError(HotColdDBError(MissingExecutionPayload(0x8147657ef7e6e0331b59ae888714ba8d60094d499a5522e7efd9f75ea2bad298))), block_root: 0x8147657ef7e6e0331b59ae888714ba8d60094d499a5522e7efd9f75ea2bad298, module: network::network_beacon_processor::rpc_methods:817

The execution client (Nethermind) is running normally, perhaps it's something to do with the DB?

Upgrading to V23 caused panic.
Steps to reproduce: Run Lighthouse V6 and checkpoint sync, wait for a minute or two, stop Lighthouse V6, change binary to this PR, and got error:

Feb 10 03:09:30.723 CRIT Task panic. This is a bug!              advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}

The last line of the backtrace shows:

29: clone3
          at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

(I have uploaded the full backtrace error in the link shared with you previously.)

When try to start agin with the PR binary, it failed:

Feb 10 03:09:48.935 INFO Blob DB initialized                     oldest_data_column_slot: None, oldest_blob_slot: Some(Slot(3589184)), path: "/home/ck/.lighthouse/holesky/beacon/blobs_db", service: freezer_db
Feb 10 03:09:48.940 CRIT Failed to start beacon node             reason: Unable to open database: SszDecodeError(InvalidByteLength { len: 32, expected: 64 })
Feb 10 03:09:48.940 INFO Internal shutdown received              reason: Failed to start beacon node

If we change back to V6 binary, it produced the error:

Feb 10 03:10:25.051 INFO Blob DB initialized                     oldest_data_column_slot: None, oldest_blob_slot: Some(Slot(3589184)), path: "/home/ck/.lighthouse/holesky/beacon/blobs_db", service: freezer_db
Feb 10 03:10:27.614 INFO Starting checkpoint sync                remote_url: https://holesky.beaconstate.ethstaker.cc/, service: beacon
Feb 10 03:10:51.567 INFO Downloading genesis state               info: this may take some time on testnets with large validator counts, timeout: 180s, server: https://holesky.beaconstate.ethstaker.cc/, service: beacon
Feb 10 03:11:08.982 INFO Loaded checkpoint block and state       block_root: 0x6bb4a025905caa2b674e0d871afeaae9f6e15b8b32d6481b51981dd05ceb42f6, state_slot: 3604480, block_slot: 3604480, service: beacon
Feb 10 03:11:08.985 INFO Loaded deposit tree snapshot            deposits loaded: 483386, service: beacon
Feb 10 03:11:17.146 CRIT Failed to start beacon node             reason: Failed to initialize anchor info: AnchorInfoConcurrentMutation
Feb 10 03:11:17.146 INFO Internal shutdown received              reason: Failed to start beacon node

We can see that it is starting checkpoint sync again (even though the database is there and I don't use the flag --purge-db-force). It should skip checkpoint sync, but it doesn't.

It sort of "ignores" the corrupted database and starts a fresh checkpoint sync on its own.

Downgrading of database schema caused panic. Steps to reproduce: run a node with the PR, then stop the node, apply the downgrade command:
/usr/local/bin/lighthouse6744 db migrate --to 22 --network holesky --datadir /path

Last line of backtrace:

  18: _start
, message: None, location: /home/ck/testing/lighthouse/beacon_node/beacon_chain/src/schema_change/migration_schema_v23.rs:108:5

michaelsproul · 2025-03-13T00:32:42Z

I've resolved conflicts with unstable after the merge of tracing. We should double check that I didn't mess up any logs or delete any code in that merge resolution, as it touched a lot of stuff.

michaelsproul · 2025-03-13T03:31:56Z

TODO:

can probably delete prune_old_hot_states 11cfa1c
can delete load_state_temporary_flag and friends 11cfa1c
delete all temporary flags in the upgrade migration f4dd6fe

michaelsproul

Merging this down so we can simplify the conga line.

This has been running well on our infra.

michaelsproul · 2025-04-03T05:45:57Z

Just going to do one more rollout and ensure logs are OK.

michaelsproul · 2025-04-07T03:32:26Z

Yep this is looking solid. Merging! 🎉

Having merged the drop-headtracker PR we now have a DB schema change in `unstable` compared to `release-v7.0.0`: - #6744 There is a DB downgrade available, however this needs to be applied manually and it's usually a bit of a hassle. This PR bumps the version on `unstable` to `v7.1.0-beta.0` _without_ actually cutting a `v7.1.0-beta.0` release, so that we can tell at a glance which schema version a node is using.

This PR implements #5978 (tree-states) but on the hot DB. It allows Lighthouse to massively reduce its disk footprint during non-finality and overall I/O in all cases. Closes #6580 Conga into #6744 ### TODOs - [x] Fix OOM in CI #7176 - [x] optimise store_hot_state to avoid storing a duplicate state if the summary already exists (should be safe from races now that pruning is cleaner) - [x] mispelled: get_ancenstor_state_root - [x] get_ancestor_state_root should use state summaries - [x] Prevent split from changing during ancestor calc - [x] Use same hierarchy for hot and cold ### TODO Good optimization for future PRs - [ ] On the migration, if the latest hot snapshot is aligned with the cold snapshot migrate the diffs instead of the full states. ``` align slot time 10485760 Nov-26-2024 12582912 Sep-14-2025 14680064 Jul-02-2026 ``` ### TODO Maybe things good to have - [ ] Rename anchor_slot https://github.com/sigp/lighthouse/compare/tree-states-hot-rebase-oom...dapplion:lighthouse:tree-states-hot-anchor-slot-rename?expand=1 - [ ] Make anchor fields not public such that they must be mutated through a method. To prevent un-wanted changes of the anchor_slot ### NOTTODO - [ ] Use fork-choice and a new method [`descendants_of_checkpoint`](ca2388e#diff-046fbdb517ca16b80e4464c2c824cf001a74a0a94ac0065e635768ac391062a8) to filter only the state summaries that descend of finalized checkpoint]

dapplion added work-in-progress PR is a work-in-progress tree-states Ongoing state and database overhaul labels Jan 2, 2025

dapplion mentioned this pull request Jan 13, 2025

Hierarchical state diffs in hot DB #6750

Merged

10 tasks

dapplion mentioned this pull request Jan 24, 2025

[WIP] Drop HeadTracker dapplion/lighthouse#8

Closed

michaelsproul reviewed Jan 31, 2025

View reviewed changes

beacon_node/store/src/hot_cold_store.rs Outdated Show resolved Hide resolved

michaelsproul reviewed Jan 31, 2025

View reviewed changes

beacon_node/beacon_chain/src/migrate.rs Outdated Show resolved Hide resolved

michaelsproul self-assigned this Jan 31, 2025

michaelsproul added the backwards-incompat Backwards-incompatible API change label Feb 3, 2025

dapplion added 3 commits February 3, 2025 16:04

Drop head tracker for summaries dag

be105d1

Improve state summary dag compute logic

8c15bab

Implement db schema upgrade and downgrade

8c9a1b2

dapplion force-pushed the drop-headtracker branch from c79ddb3 to 8c9a1b2 Compare February 3, 2025 19:16

dapplion added 4 commits February 3, 2025 16:17

Log about multiple roots in dag tree

10bbb2e

Add states descendants_of

28d7b74

Prune descendants of finalized checkpoint not finalized block

c5b4293

Prevent very long log line

066f96a

dapplion and others added 5 commits February 3, 2025 18:00

Update tests

663dfd3

Tweak logs

979e43a

Annotate SummariesDagError error

e56299c

Deprecate block DAG for pruning

91eab38

Remove some persisted head stuff

12fa5a8

michaelsproul reviewed Feb 4, 2025

View reviewed changes

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

michaelsproul reviewed Feb 4, 2025

View reviewed changes

beacon_node/beacon_chain/src/migrate.rs Outdated Show resolved Hide resolved

dapplion added 2 commits February 4, 2025 16:15

Use slot clock in heads

ed97b97

Fix nodes_without_children

b9d8ae7

Merge remote-tracking branch 'origin/unstable' into drop-headtracker

b43c071

michaelsproul and others added 7 commits March 13, 2025 15:47

Purge temporary states more, fixing bug

11cfa1c

Clean up temporary flags on migration

f4dd6fe

Fix compilation

a322463

Merge remote-tracking branch 'origin/unstable' into drop-headtracker

ac862df

Remove unnecessary TODOs

be15847

Merge remote-tracking branch 'origin/unstable' into drop-headtracker

37706b6

Fix tree_roots

8f31510

michaelsproul added ready-for-review The code is ready for review v7.1.0 Post-Electra release and removed work-in-progress PR is a work-in-progress labels Apr 3, 2025

michaelsproul approved these changes Apr 3, 2025

View reviewed changes

michaelsproul added under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Apr 3, 2025

michaelsproul added ready-for-merge This PR is ready to merge. and removed under-review A reviewer has only partially completed a review. labels Apr 7, 2025

mergify bot added a commit that referenced this pull request Apr 7, 2025

Merge of #6744

056f6c5

mergify bot mentioned this pull request Apr 7, 2025

merge queue: embarking unstable (b5d40e3) and #6744 together #7268

Closed

6 tasks

mergify bot merged commit 70850fe into unstable Apr 7, 2025
31 checks passed

mergify bot deleted the drop-headtracker branch April 7, 2025 04:23

michaelsproul mentioned this pull request Apr 7, 2025

Bump version to v7.1.0-beta.0 (not a release) #7269

Merged

michaelsproul mentioned this pull request Apr 15, 2025

Pruning logic broken on unstable #7323

Closed

michaelsproul mentioned this pull request Apr 28, 2025

State cache misses in v7.0.0 #7363

Closed

michaelsproul mentioned this pull request Jun 20, 2025

Database schema changes 2024-2025 #5912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drop head tracker for summaries DAG #6744

Drop head tracker for summaries DAG #6744

Uh oh!

dapplion commented Dec 28, 2024 •

edited

Loading

Uh oh!

dapplion commented Dec 28, 2024

Uh oh!

dapplion commented Jan 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

mergify bot commented Jan 31, 2025

Uh oh!

Uh oh!

dapplion commented Feb 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chong-he commented Feb 12, 2025 •

edited

Loading

Uh oh!

michaelsproul commented Mar 13, 2025

Uh oh!

michaelsproul commented Mar 13, 2025 •

edited by dapplion

Loading

Uh oh!

michaelsproul left a comment

Uh oh!

michaelsproul commented Apr 3, 2025

Uh oh!

michaelsproul commented Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

Drop head tracker for summaries DAG #6744

Drop head tracker for summaries DAG #6744

Uh oh!

Conversation

dapplion commented Dec 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Uh oh!

dapplion commented Dec 28, 2024

Uh oh!

dapplion commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 31, 2025

Uh oh!

Uh oh!

dapplion commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chong-he commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelsproul commented Mar 13, 2025

Uh oh!

michaelsproul commented Mar 13, 2025 • edited by dapplion Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelsproul left a comment

Choose a reason for hiding this comment

Uh oh!

michaelsproul commented Apr 3, 2025

Uh oh!

michaelsproul commented Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

dapplion commented Dec 28, 2024 •

edited

Loading

dapplion commented Jan 2, 2025 •

edited

Loading

dapplion commented Feb 3, 2025 •

edited

Loading

chong-he commented Feb 12, 2025 •

edited

Loading

michaelsproul commented Mar 13, 2025 •

edited by dapplion

Loading