Skip to content

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Dec 28, 2024

Issue Addressed

The head tracker is a persisted piece of state that must be kept in sync with the fork-choice. It has been a source of pruning issues in the past, so we want to remove it

When implementing tree-states in the hot DB we have to change the pruning routine (more details below) so we want to do those changes first in isolation.

Closes #1785

Proposed Changes

Current DB migration routine

  • Locate abandoned heads with head tracker
  • Use a roots iterator to collect the ancestors of those heads can be pruned
  • Delete those abandoned blocks / states
  • Migrate the newly finalized chain to the freezer

In summary, it computes what it has to delete and keeps the rest. Then it migrates data to the freezer. If the abandoned forks routine has a bug it can break the freezer migration.

Proposed migration routine (this PR)

  • Migrate the newly finalized chain to the freezer
  • Load all state summaries from disk
  • From those, just knowing the head and finalized block compute two sets: (1) descendants of finalized (2) newly finalized chain
  • Iterate all summaries, if a summary does not belong to set (1) or (2), delete

This strategy is more sound as it just checks what's there in the hot DB, computes what it has to keep and deletes the rest. Because it does not rely and 3rd pieces of data we can drop the head tracker and pruning checkpoint. Since the DB migration happens first now, as long as the computation of the sets to keep is correct we won't have pruning issues.

@dapplion
Copy link
Collaborator Author

@michaelsproul If we stop updating the head_tracker in the persisted head, how can we downgrade safely? If you run an old version of Lighthouse it will read the persisted head which contains an empty head tracker.

@dapplion dapplion added work-in-progress PR is a work-in-progress tree-states Ongoing state and database overhaul labels Jan 2, 2025
@dapplion
Copy link
Collaborator Author

dapplion commented Jan 2, 2025

Test store_tests::finalizes_non_epoch_start_slot panics at

---- store_tests::finalizes_non_epoch_start_slot stdout ----
thread 'store_tests::finalizes_non_epoch_start_slot' panicked at /Users/lion/code/sigp/lighthouse/beacon_node/beacon_chain/src/test_utils.rs:836:14:
called `Result::unwrap()` on an `Err` value: HotColdDBError(MissingEpochBoundaryState(0x8da1c38d20071452e39a518001b61aa94ad871c6b95896203ddb0b68cfb89e5f))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: beacon_chain::test_utils::BeaconChainHarness<beacon_chain::builder::Witness<slot_clock::manual_slot_clock::ManualSlotClock,beacon_chain::eth1_chain::CachingEth1Backend<E>,E,Hot,Cold>>::hot_state_exists
   4: beacon_chain_tests::store_tests::finalizes_non_epoch_start_slot::{{closure}}
   5: tokio::runtime::runtime::Runtime::block_on
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This is part of an assertion in the test calling hot_state_exists, which now instead of returning an Ok(bool) returns an Err.

This state is not pruned, but is the new_finalized_state_hash

Jan 02 10:31:34.615 DEBG Starting database pruning, new_finalized_state_hash: 0x8da1c38d20071452e39a518001b61aa94ad871c6b95896203ddb0b68cfb89e5f, new_finalized_checkpoint: Checkpoint { epoch: Epoch(1), root: 0x401d573d920244cb9b9988da85f8093f6bedb1e68156774b87c173f6ded6623b }, module: beacon_chain::migrate:485
Jan 02 10:31:34.619 DEBG Extra pruning information, states_to_prune: {}, blocks_to_prune: {}, states_to_prune_count: 0, blocks_to_prune_count: 0, finalized_and_descendant_block_roots: 40, state_summaries_count: 40, newly_finalized_states_min_slot: 8, newly_finalized_state_roots: 1, newly_finalized_blocks_min_slot: 7, newly_finalized_blocks: 1, new_finalized_checkpoint: Checkpoint { epoch: Epoch(1), root: 0x401d573d920244cb9b9988da85f8093f6bedb1e68156774b87c173f6ded6623b }, module: beacon_chain::migrate:645

Michael: figure out which of the hot_state_exists cases triggers, plus what state root and slot

Copy link

mergify bot commented Jan 31, 2025

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

@michaelsproul michaelsproul self-assigned this Jan 31, 2025
@michaelsproul michaelsproul added the backwards-incompat Backwards-incompatible API change label Feb 3, 2025
@dapplion
Copy link
Collaborator Author

dapplion commented Feb 3, 2025

@michaelsproul I have

  • Fixed conflicts
  • Added migration
  • Remove any hot DB op from migrate_db, looks like an oversight when creating this PR initially
  • Addressed the case where the finalized block is skipped, now we keep only descendants of the finalized checkpoint

@chong-he
Copy link
Member

chong-he commented Feb 12, 2025

Done some testing using binary built with the commit 12fa5a8.

  1. The node can fully sync an archived node, although this error appears from time to time:
ERRO Error fetching block for peer, error: DBError(HotColdDBError(MissingExecutionPayload(0x8147657ef7e6e0331b59ae888714ba8d60094d499a5522e7efd9f75ea2bad298))), block_root: 0x8147657ef7e6e0331b59ae888714ba8d60094d499a5522e7efd9f75ea2bad298, module: network::network_beacon_processor::rpc_methods:817

The execution client (Nethermind) is running normally, perhaps it's something to do with the DB?

  1. Upgrading to V23 caused panic.
    Steps to reproduce: Run Lighthouse V6 and checkpoint sync, wait for a minute or two, stop Lighthouse V6, change binary to this PR, and got error:
Feb 10 03:09:30.723 CRIT Task panic. This is a bug!              advice: Please check above for a backtrace and notify the developers, backtrace:    0: lighthouse::run::{{closure}}

The last line of the backtrace shows:

29: clone3
          at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

(I have uploaded the full backtrace error in the link shared with you previously.)

When try to start agin with the PR binary, it failed:

Feb 10 03:09:48.935 INFO Blob DB initialized                     oldest_data_column_slot: None, oldest_blob_slot: Some(Slot(3589184)), path: "/home/ck/.lighthouse/holesky/beacon/blobs_db", service: freezer_db
Feb 10 03:09:48.940 CRIT Failed to start beacon node             reason: Unable to open database: SszDecodeError(InvalidByteLength { len: 32, expected: 64 })
Feb 10 03:09:48.940 INFO Internal shutdown received              reason: Failed to start beacon node

If we change back to V6 binary, it produced the error:

Feb 10 03:10:25.051 INFO Blob DB initialized                     oldest_data_column_slot: None, oldest_blob_slot: Some(Slot(3589184)), path: "/home/ck/.lighthouse/holesky/beacon/blobs_db", service: freezer_db
Feb 10 03:10:27.614 INFO Starting checkpoint sync                remote_url: https://holesky.beaconstate.ethstaker.cc/, service: beacon
Feb 10 03:10:51.567 INFO Downloading genesis state               info: this may take some time on testnets with large validator counts, timeout: 180s, server: https://holesky.beaconstate.ethstaker.cc/, service: beacon
Feb 10 03:11:08.982 INFO Loaded checkpoint block and state       block_root: 0x6bb4a025905caa2b674e0d871afeaae9f6e15b8b32d6481b51981dd05ceb42f6, state_slot: 3604480, block_slot: 3604480, service: beacon
Feb 10 03:11:08.985 INFO Loaded deposit tree snapshot            deposits loaded: 483386, service: beacon
Feb 10 03:11:17.146 CRIT Failed to start beacon node             reason: Failed to initialize anchor info: AnchorInfoConcurrentMutation
Feb 10 03:11:17.146 INFO Internal shutdown received              reason: Failed to start beacon node

We can see that it is starting checkpoint sync again (even though the database is there and I don't use the flag --purge-db-force). It should skip checkpoint sync, but it doesn't.

It sort of "ignores" the corrupted database and starts a fresh checkpoint sync on its own.

  1. Downgrading of database schema caused panic. Steps to reproduce: run a node with the PR, then stop the node, apply the downgrade command:
    /usr/local/bin/lighthouse6744 db migrate --to 22 --network holesky --datadir /path

Last line of backtrace:

  18: _start
, message: None, location: /home/ck/testing/lighthouse/beacon_node/beacon_chain/src/schema_change/migration_schema_v23.rs:108:5

@michaelsproul
Copy link
Member

I've resolved conflicts with unstable after the merge of tracing. We should double check that I didn't mess up any logs or delete any code in that merge resolution, as it touched a lot of stuff.

@michaelsproul
Copy link
Member

michaelsproul commented Mar 13, 2025

TODO:

  • can probably delete prune_old_hot_states 11cfa1c
  • can delete load_state_temporary_flag and friends 11cfa1c
  • delete all temporary flags in the upgrade migration f4dd6fe

@michaelsproul michaelsproul added ready-for-review The code is ready for review v7.1.0 Post-Electra release and removed work-in-progress PR is a work-in-progress labels Apr 3, 2025
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this down so we can simplify the conga line.

This has been running well on our infra.

@michaelsproul michaelsproul added under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Apr 3, 2025
@michaelsproul
Copy link
Member

Just going to do one more rollout and ensure logs are OK.

@michaelsproul
Copy link
Member

Yep this is looking solid. Merging! 🎉

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed under-review A reviewer has only partially completed a review. labels Apr 7, 2025
mergify bot added a commit that referenced this pull request Apr 7, 2025
@mergify mergify bot merged commit 70850fe into unstable Apr 7, 2025
31 checks passed
@mergify mergify bot deleted the drop-headtracker branch April 7, 2025 04:23
mergify bot pushed a commit that referenced this pull request Apr 7, 2025
Having merged the drop-headtracker PR we now have a DB schema change in `unstable` compared to `release-v7.0.0`:

- #6744

There is a DB downgrade available, however this needs to be applied manually and it's usually a bit of a hassle.

This PR bumps the version on `unstable` to `v7.1.0-beta.0` _without_ actually cutting a `v7.1.0-beta.0` release, so that we can tell at a glance which schema version a node is using.
mergify bot pushed a commit that referenced this pull request Jun 19, 2025
This PR implements #5978 (tree-states) but on the hot DB. It allows Lighthouse to massively reduce its disk footprint during non-finality and overall I/O in all cases.

Closes #6580

Conga into #6744

### TODOs

- [x] Fix OOM in CI #7176
- [x] optimise store_hot_state to avoid storing a duplicate state if the summary already exists (should be safe from races now that pruning is cleaner)
- [x] mispelled: get_ancenstor_state_root
- [x] get_ancestor_state_root should use state summaries
- [x] Prevent split from changing during ancestor calc
- [x] Use same hierarchy for hot and cold

### TODO Good optimization for future PRs

- [ ] On the migration, if the latest hot snapshot is aligned with the cold snapshot migrate the diffs instead of the full states.
```
align slot  time
10485760    Nov-26-2024
12582912    Sep-14-2025
14680064    Jul-02-2026
```

### TODO Maybe things good to have

- [ ] Rename anchor_slot https://github.com/sigp/lighthouse/compare/tree-states-hot-rebase-oom...dapplion:lighthouse:tree-states-hot-anchor-slot-rename?expand=1
- [ ] Make anchor fields not public such that they must be mutated through a method. To prevent un-wanted changes of the anchor_slot

### NOTTODO

- [ ] Use fork-choice and a new method [`descendants_of_checkpoint`](ca2388e#diff-046fbdb517ca16b80e4464c2c824cf001a74a0a94ac0065e635768ac391062a8) to filter only the state summaries that descend of finalized checkpoint]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompat Backwards-incompatible API change ready-for-merge This PR is ready to merge. tree-states Ongoing state and database overhaul v7.1.0 Post-Electra release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants