Skip to content

Lighthouse OOM mitigations #7053

@michaelsproul

Description

@michaelsproul

Short term plan:

  1. Move banned block checks higher in block verification to prevent repeat state lookups (before every instance of load_parent in block_verification.rs)
  2. Encourage use of --state-cache-size 4 to avoid bad state cache pruning logic that is keeping 128x 180MB epoch boundary states around (~24GB of states).
  3. (DONE) Remove block root lookups from status processing. We are getting killed looking up old states to compute the block root. We need a more aggressive version of this PR: Optimise status processing #5481.

Point (1) is intended to fix an OOM that happens to nodes that are in sync and forced to process junk.

Point (2) fixes OOMs during head sync due to lots of epoch boundary states being retain.

To investigate later:

  1. Why are epoch boundary state diffs so large (180MB+), given that we should be basing them off each other while syncing sequential blocks? Answer: balances and inactivity_scores.
  2. Is an earlier invalid block check sufficient to prevent OOM while synced? Are there are other states or valid side chains which are forcing us to load states and use too much memory?
  3. Why is sync sending us so many copies of the invalid block? Is there parallelism that is causing the OOM near the head?

Future plans (long-term fixes):

  1. Implement the PromiseCache concept used for attestation committees for beacon states. This is quite subtle to get right, a version was previously attempted but abandoned (Unify and lower state caches #5313). Tracking issue: Improve & unify parallel de-duplication caches #5112
  2. Implement size-based pruning for the state cache. This is possible with my WIP changes from: State cache memory size WIP #6532. However, that code is quite immature and the pruning itself is expensive (1.5s-4s or more), so we cannot ship this quickly. There is also some subtlety around deciding which states to prune based on size (we could use a similar heuristic to the existing cull method on the 20% largest states).
  3. Re-think pruning logic in cull so that it doesn't hang on to so many useless epoch boundary states.

Metadata

Metadata

Assignees

No one assigned

    Labels

    optimizationSomething to make Lighthouse run more efficiently.v7.0.0New release c. Q1 2025v7.0.0-beta.cleanClean release post Holesky rescue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions