Skip to content

Conversation

jamesob
Copy link
Contributor

@jamesob jamesob commented Feb 1, 2022

This is part of the assumeutxo project (parent PR: #15606)


This changeset adds logic for detecting and activating snapshot chainstates on start, and for completing the snapshot validation process once the background chainstate reaches the base block of the assumed-valid chain. It also handles removing the snapshot chainstate data on-disk in the event that the snapshot fails to validate.

As detailed in assumeutxo.md:

Once the tip of the background chainstate hits the base block of the snapshot chainstate, we stop use of the background chainstate by setting m_stop_use, in CompleteSnapshotValidation(), which is checked in ActivateBestChain()). We hash the background chainstate's UTXO set contents and ensure it matches the compiled value in CMainParams::m_assumeutxo_data.

The background chainstate data lingers on disk until shutdown, when in ChainstateManager::Reset(), the background chainstate is cleaned up with ValidatedSnapshotShutdownCleanup(), which renames the chainstate_[hash] datadir as chainstate.

Failure consideration: if bitcoind unexpectedly halts after m_stop_use is set on the background chainstate but before CompleteSnapshotValidation() can finish, the need to complete snapshot validation will be detected on subsequent init by ChainstateManager::CheckForUncleanShutdown().


Most of this change is unittested, though some logic (anything assuming on-disk leveldb data) can't be tested without some rework of unittest setup, given in-memory leveldb instances are used. I can look into how difficult this would be.

This change adds the ability to unittest using on-disk leveldb coins dbs.

Possible follow-ups

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 2, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #25598 (doc: assumeutxo: format tables to be aligned in plain-text by theStack)
  • #25564 (Move DEFAULT_MAX_ORPHAN_TRANSACTIONS to node/txorphanage.h by MarcoFalke)
  • #25527 ([kernel 3c/n] Decouple validation cache initialization from ArgsManager by dongcarl)
  • #25077 (Fix chain tip data race and corrupt rest response by MarcoFalke)
  • #25038 (policy: nVersion=3 and Package RBF by glozow)
  • #22663 (dbwrapper: properly destroy objects in case CDBWrapper ctor throws by Crypt-iQ)
  • #19888 (rpc, test: Improve getblockstats for unspendables by fjahr)
  • #17786 (refactor: Nuke policy/fees->mempool circular dependencies by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are test failures

@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch 2 times, most recently from 8cdafd7 to 0952728 Compare February 2, 2022 15:12
@jamesob
Copy link
Contributor Author

jamesob commented Feb 2, 2022

Thanks for the quick feedback Mco.

The test failure was actually due to removing a commit that I'd forgotten the use for. Basically, because we're now calling MaybeRebalanceCaches in LoadChainstate, that prompts a FlushStateToDisk call before the block index is loaded, which trips some UB.

The bounds check in d70b8de fixes this.

@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch 2 times, most recently from 52ba06d to dbee719 Compare February 2, 2022 19:54
@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch from dbee719 to 7e2b212 Compare February 2, 2022 20:08
@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch from 7e2b212 to 86b252c Compare February 8, 2022 00:59
@jamesob
Copy link
Contributor Author

jamesob commented Feb 8, 2022

Okay, I've pushed a revision that addresses my previous comments and more comprehensively (that is to say: at all) tests the major failure modes. I had to parameterize the shutdown behavior during snapshot completion to get this to work within the unittests.

Pending CI pass, this should be ready for continued review.

@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch 2 times, most recently from 1ac03ee to 9f89221 Compare February 8, 2022 21:04
@jamesob
Copy link
Contributor Author

jamesob commented Feb 8, 2022

Alright, CI is finally passing. I had to add 5d5dfaa because an existing unittest that advanced an IBD chain past the snapshot it generated started failing. This isn't something that should happen in practice, but was done to test an unrelated feature (UpdateTip behavior on a background chainstate).

// removing. Without those headers, we can't activate the snapshot below.
LOCK(::cs_main);
uint256 gen_hash = node.chainman->ActiveChainstate().m_chain[0]->GetBlockHash();
node.chainman->Reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like #24299 was merged, so you might have to inline the needed resets here (or create a helper function for it).

@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch from 9f89221 to 51d0dd3 Compare March 7, 2022 16:52
@jamesob
Copy link
Contributor Author

jamesob commented Mar 7, 2022

Alright, have rebased; basically I just took the old ChainstateManager::Reset() method and renamed it to cleanup(), and applied @jonatack -style changes (lock assertion/annotation vs. LOCK).

@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch from d003388 to 84f7884 Compare July 20, 2022 14:43
jamesob and others added 18 commits July 20, 2022 11:03
This is used in subsequent commits. It allows us to clean up UTXO
snapshot chainstate after background validation completes.
This changes the snapshot's leveldb chainstate dir name from
`chainstate_[blockhash]` to `chainstate_snapshot`. This simplifies
later logic that loads snapshot data, and enforces the limitation
of a single snapshot at any given time.

Since we still need to persis the blockhash of the base block, we
write that out to a file (`chainstate_snapshot/base_blockhash`) for
later use during initialization, so that we can reinitialize the
snapshot chainstate.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Add functionality for activating a snapshot-based chainstate if one is
detected on-disk.

Also cautiously initialize chainstate cache usages so that we don't
somehow blow past our cache allowances during initialization, then
rebalance at the end of init.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
Used in later commits to remove leveldb directories for
- invalid snapshot chainstates, and
- background-vaildation chainstates that have finished serving their
  purpose.
If a UTXO snapshot fails to validate, don't leave the resulting datadir
on disk as this will confuse initialization on next startup and we'll
get an assertion error.
If we call FlushBlockFile() without having intitialized the block index
with LoadBlockIndexDB(), we may be indexing into an empty vector.

Specifically this is an issue when we call MaybeRebalanceCaches() during
chainstate init before the block index has been loaded, which calls
FlushBlockFile().

Also add an assert to avoid undefined behavior.

Co-authored-by: Russell Yanofsky <russ@yanofsky.org>
and remove m_snapshot_validated. This state can now be inferred by the
number of isUsable chainstates.

m_stop_use is used to signal that a chainstate should no longer be used
by validation logic; it is used as a sentinel when background validation
completes or if the snapshot chainstate is found to be invalid.

isUsable is a convenience method that incorporates m_stop_use.
Used in subsequent commits. Also cleans up asserts in
coins_views-related convenience methods to be more exact.
Moves chainstate initialization into its own function. This is necessary
to later support a more readable way of handling background-validation
chainstate cleanup during init, since the chainstate initialization
functions may need to be repeated after moving leveldb filesystem
content around.

This commit isn't strictly necessary, but the alternative is to (ab)use
the `while` loop in init.cpp with a `continue` on the basis of a
specific ChainstateLoadingError return value from LoadChainstate. Not
only is this harder to read, but it can't be unittested.

The approach here lets us consolidate background-validation cleanup to
LoadChainstate, and therefore exercise it within tests.

This commit is most easily reviewed with

  git diff --color-moved=dimmed-zebra --color-moved-ws=ignore-space-change
Trigger completion when a background validation chainstate reaches the
same height as a UTXO snapshot, and handle cleaning up the chainstate on
subsequent startup.
I found this useful during unittest debugging.
For use in next commit.

Most easily reviewed with
`--color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change`.
This CreateAndActivateUTXOSnapshot parameter is necessary once we
perform snapshot completion within ABC, since the existing UpdateTip
test will fail because the IBD chain that has generated the snapshot
will exceed the base of the snapshot.

Being able to test snapshots being loaded into a mostly-uninitialized
datadir allows for more realistic unittest scenarios.
Used when testing cleanup of on-disk chainstate data for snapshot
testcases. Also necessary for simulating node restart in .cpp tests.
in TestingSetup(). This is used in the following commit to test
reinitializing chainstates after snapshot validation and cleanup.

Best reviewed with `git diff --color-moved=dimmed-zebra`.
Include notes about the `chainstate_snapshot` rename as well as
updates for the included code.
@jamesob jamesob force-pushed the 2022-02-au-init-and-completion branch from 84f7884 to ac8c9d8 Compare July 20, 2022 16:08
Copy link
Contributor Author

@jamesob jamesob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased and addressed feedback. Thanks Russ and Antoine.

I'm approaching "thermonuclear levels of burnout" on this one, as the phrase goes. I'm getting bludgeoned by rebases, so any help reviewing this is appreciated.

I've looked at splitting this out into two PRs and while it is possible it will not be trivial for me to do. If any potential reviewers feel strongly about doing that, I can.


bool WriteSnapshotBaseBlockhash(CChainState& snapshot_chainstate)
{
AssertLockHeld(::cs_main);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conceptually this should hold cs_main, since we're writing to a chainstate directory. Splitting cs_main out from per-chainstate locks should be a larger follow-on effort, and I think we can delay reworking this call until then.

// DestroyDB() (in DeleteCoinsDBFromDisk()) will fail. See `leveldb::~DBImpl()`.
// Destructing the chainstate (and so resetting the coinsviews object) does this.
snapshot_chainstate.reset();
bool removed = DeleteCoinsDBFromDisk(*snapshot_datadir, /*is_snapshot=*/true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! I've added test coverage, and the tests now fail when this line is deleted.

@jamesob
Copy link
Contributor Author

jamesob commented Jul 21, 2022

Okay, I think it's pretty clear that the right move is to split this PR up into two separate ones (initialization vs. completion). I'll be opening those shortly.

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK ac8c9d8. Rebase and a lot of suggested cleanups since the last review, nothing major

I'm still a little confused about MaybeCompleteSnapshotValidation error handling, but in the worst case problems might just make debugging & error recovery less convenient so this is not a blocker

if (this != &m_chainman.ActiveChainstate()) {
// This call may set `m_stop_use`, which is referenced immediately afterwards in
// ActivateBestChain.
m_chainman.MaybeCompleteSnapshotValidation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #24232 (comment)

In commit "validation: add ChainMan logic for completing UTXO snapshot validation" (1eb88f8)

Based upon my (re)reading of MaybeCompleteSnapshotValidation(), the snapshot cannot be found invalid (without some kind of code error) without calling handle_invalid_snapshot(), which will mark the snapshot chainstate as unusable and remove the associated coinsdb.

It seems like it would be possible to add some assertions about the return value or chain state here to verify that. But if you think that is overkill it would be good at least update the comment to be explicit that you are intentionally ignoring the return value. Could prefix "If this fails..." with "Intentionally ignoring return value. If this fails..."

If MaybeCompleteSnapshotValidation always cleans up after itself I guess I also don't understand why it is not an error if it fails here, but it is an error if it fails the other place it is called on startup.

If might be clearer if MaybeCompleteSnapshotValidation just returned an error code and callers would handle cleanup to make code clearer and easier to reason about.

uint256 snapshot_blockhash = *SnapshotBlockhash();

if (index_new.GetBlockHash() != snapshot_blockhash) {
LogPrintf("[snapshot] supposed base block %s does not match the " /* Continued */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "validation: add ChainMan logic for completing UTXO snapshot validation" (1eb88f8)

This seems unclear what is supposed or why. Would seem clearer to say "Snapshot base block {snapshot_blockhash} height {snapshot_base_height} doesn't match validated block {index_new.hash} height {index_new.height}"

achow101 added a commit to bitcoin-core/gui that referenced this pull request Oct 13, 2022
bf95976 doc: add note about snapshot chainstate init (James O'Beirne)
e4d7995 test: add testcases for snapshot initialization (James O'Beirne)
cced4e7 test: move-only-ish: factor out LoadVerifyActivateChainstate() (James O'Beirne)
51fc924 test: allow on-disk coins and block tree dbs in tests (James O'Beirne)
3c36139 test: add reset_chainstate parameter for snapshot unittests (James O'Beirne)
00b357c validation: add ResetChainstates() (James O'Beirne)
3a29dfb move-only: test: make snapshot chainstate setup reusable (James O'Beirne)
8153bd9 blockmanager: avoid undefined behavior during FlushBlockFile (James O'Beirne)
ad67ff3 validation: remove snapshot datadirs upon validation failure (James O'Beirne)
34d1590 add utilities for deleting on-disk leveldb data (James O'Beirne)
252abd1 init: add utxo snapshot detection (James O'Beirne)
f9f1735 validation: rename snapshot chainstate dir (James O'Beirne)
d14bebf db: add StoragePath to CDBWrapper/CCoinsViewDB (James O'Beirne)

Pull request description:

  This is part of the [assumeutxo project](https://github.com/bitcoin/bitcoin/projects/11) (parent PR: bitcoin/bitcoin#15606)

  ---

  Half of the replacement for #24232. The original PR grew larger than expected throughout the review process.

  This change adds the ability to initialize a snapshot-based chainstate during init if one is detected on disk. This is of course unused as of now (aside from in unittests) given that we haven't yet enabled actually loading snapshots.

  Don't be scared! There are some big move-only commits in here.

  Accompanying changes include:

  - moving the snapshot coinsdb directory from being called `chainstate_[base blockhash]` to `chainstate_snapshot`, since we only support one snapshot in use at a time. This simplifies some logic, but it necessitates writing that base blockhash out to a file within the coinsdb dir. See [discussion here](bitcoin/bitcoin#24232 (comment)).
  - adding a simple fix in `FlushBlockFile()` that avoids a crash when attemping to flush to disk before `LoadBlockIndexDB()` is called, which happens when calling `MaybeRebalanceCaches()` during multiple chainstate init.
  - improving the unittest to allow testing with on-disk chainstates - necessary to test a simulated restart and re-initialization.

ACKs for top commit:
  naumenkogs:
    utACK bf95976
  ariard:
    Code Review ACK bf95976
  ryanofsky:
    Code review ACK bf95976. Changes since last review: rebasing, switching from CAutoFile to AutoFile, adding comments, switching from BOOST_CHECK to Assert in test util, using chainman.GetMutex() in tests, destroying one ChainstateManager before creating a new one in tests
  fjahr:
    utACK bf95976
  aureleoules:
    ACK bf95976

Tree-SHA512: 15ae75caf19f8d12a12d2647c52897904d27b265a7af6b4ae7b858592eeadb8f9da6c2394b6baebec90adc28742c053e3eb506119577dae7c1e722ebb3b7bcc0
achow101 added a commit to bitcoin-core/gui that referenced this pull request Mar 7, 2023
…tion

2b373fe docs: update assumeutxo.md (James O'Beirne)
87a1108 test: add snapshot completion unittests (James O'Beirne)
d70919a refactor: make MempoolMutex() public (James O'Beirne)
7300ced log: add LoadBlockIndex() message for assumedvalid blocks (James O'Beirne)
d96c59c validation: add ChainMan logic for completing UTXO snapshot validation (James O'Beirne)
f2a4f33 move-only-ish: init: factor out chainstate initialization (James O'Beirne)
637a90b add Chainstate::HasCoinsViews() (James O'Beirne)
c29f26b validation: add CChainState::m_disabled and ChainMan::isUsable (James O'Beirne)
5ee22cd add ChainstateManager.GetSnapshot{BaseHeight,BaseBlock}() (James O'Beirne)

Pull request description:

  This is part of the [assumeutxo project](https://github.com/bitcoin/bitcoin/projects/11) (parent PR: bitcoin/bitcoin#15606)

  Part two of replacing bitcoin/bitcoin#24232.

  ---

  When a user activates a snapshot, the serialized UTXO set data is used to create an "assumed-valid" chainstate, which becomes active in an attempt to get the node to network tip as quickly as possible. Simultaneously in the background, the already-existing chainstate continues "conventional" IBD to both accumulate full block data and serve as a belt-and-suspenders to validate the assumed-valid chainstate.

  Once the background chainstate's tip reaches the base block of the snapshot used, we set `m_stop_use` on that chainstate and immediately take the hash of its UTXO set; we verify that this matches the assumeutxo value in the source code. Note that while we ultimately want to remove this background chainstate, we don't do so until the following initialization process, when we again check the UTXO set hash of the background chainstate, and if it continues to match, we remove the (now unnecessary) background chainstate, and move the (previously) assumed-valid chainstate into its place. We then reinitialize the chainstate in the normal way.

  As noted in previous comments, we could do the filesystem operations "inline" immediately when the background validation completes, but that's basically just an optimization that saves disk space until the next restart. It didn't strike me as worth the risk of moving chainstate data around on disk during runtime of the node, though maybe my concerns are overblown.

  The final result of this completion process is a fully-validated chain, where the only evidence that the user synced using assumeutxo is the existence of a `base_blockhash` file in the `chainstate` directory.

ACKs for top commit:
  achow101:
    ACK 2b373fe

Tree-SHA512: a204e1d6e6932dd83c799af3606b01a9faf893f04e9ee1a36d63f2f1ccfa9118bdc1c107d86976aa0312814267e6a42074bf3e2bf1dead4b2513efc6d955e13d
@bitcoin bitcoin locked and limited conversation to collaborators Aug 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants