Skip to content

Conversation

mzumsande
Copy link
Contributor

When the block index database is non-contiguous due to file corruption (i.e. it contains indexes of height x-1 and x+1, but not x), bitcoind can currently crash with an assert in BuildSkip() / GetAncestor() during BlockManager::LoadBlockIndex():

bitcoind: chain.cpp:112: const CBlockIndex* CBlockIndex::GetAncestor(int) const: Assertion `pindexWalk->pprev' failed.

This PR changes it such that we instead return an InitError to the user.

I stumbled upon this because I noticed that the file perturbation in feature_init.py wasn't working as intended, which is fixed in the second commit:

  • Opening the file twice in one with statement would lead to tf_read being empty, so the test wouldn't perturb anything but replace the file with a new one. Fixed by first opening for read, then for write.
  • We need to restore the previous state after perturbations, so that only the current perturbation is active and not a mix of the current and previous ones.
  • I also added checkblocks=200 to the startup parameters so that corruption in earlier blocks of blk00000.dat is detected during init verification and not ignored.

After fixing feature_init.py like that I'd run into the assert mentioned above (so running the testfix from the second commit without the first one is a way to reproduce it).

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 4, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK furszy, fjahr, achow101
Concept ACK jonatack, TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

@Brotcrunsher
Copy link
Contributor

Little disclaimer: It's my first Review for Bitcoin Core. So treat my input with care, I might have no clue what I am talking about.

What I don't like about this change is that previously the user at least had a chance of finding out what's wrong, without debugging the code. Now he won't have any info of what went wrong other than "Error loading block database", which can have multiple other reasons. Is there a way for us to communicate this better? Maybe a log? Or can we throw here? The sad thing is that the return seems to be intermangled with none error cases like ShutdownRequested.

Comment on lines 263 to 268
CBlockIndex* previous_index{nullptr};
for (CBlockIndex* pindex : vSortedByHeight) {
if (ShutdownRequested()) return false;
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) return false; // corruption, non-contiguous block indexes
previous_index = pindex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't vSortedByHeight contain blocks from different chains at the same height?

Copy link
Contributor Author

@mzumsande mzumsande Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "different chains"? vSortedHeight could contain blocks with the same height x (forks), but I can't see how these could have made it into the block index without having a predecessor of height x - 1 at the time they were added. Also, these shouldn't make the added check fail.

I do wonder though if it would be possible / more general to directly check that each block of height > 0 has a predecessor in the chain, instead of doing what the PR currently suggests. I'll look into that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "different chains"? vSortedHeight could contain blocks with the same height x (forks), but I can't see how these could have made it into the block index without having a predecessor of height x - 1 at the time they were added. Also, these shouldn't make the added check fail.

yeah ok. nvm. Got confused by the early return (I thought it in the other way around).
The check will have the previous block at the same height, so pindex->nHeight > previous_index->nHeight + 1 will be false.

I do wonder though if it would be possible / more general to directly check that each block of height > 0 has a predecessor in the chain, instead of doing what the PR currently suggests. I'll look into that.

Wouldn't that be broken if we ever add a parallel headers download process?
Which IIRC we don't have because they are downloaded from a single peer at time. But if we implement it, we could get orphan block indexes pretty easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be broken if we ever add a parallel headers download process?
Which IIRC we don't have because they are downloaded from a single peer at time. But if we implement it, we could get orphan block indexes pretty easily.

Yes, but only we had a very naive parallel headers download process. I would imagine that if we ever parallelized headers download, we'd take care not to accept orphaned block indexes into our main block index (for DoS reasons) but would keep those headers in some kind of intermediate buffer until we can connect them.

net_processing makes sure that we only save headers that we connect (code).

@mzumsande
Copy link
Contributor Author

Now he won't have any info of what went wrong other than "Error loading block database", which can have multiple other reasons. Is there a way for us to communicate this better? Maybe a log? Or can we throw here?

Thanks! Yes, adding a log seems like a good idea, I'll do that with the next push!

@mzumsande
Copy link
Contributor Author

Added a Log error, and made the commit message more precise.

I do wonder though if it would be possible / more general to directly check that each block of height > 0 has a predecessor in the chain, instead of doing what the PR currently suggests.

Since the current approach wouldn't catch all possible cases of a non-contiguous block index (but the one encountered in feature_init.py !), still looking to into having a stricter check - opinions welcome!

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

for (CBlockIndex* pindex : vSortedByHeight) {
if (ShutdownRequested()) return false;
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) {
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1);
Copy link
Member

@jonatack jonatack Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've probably confused myself by how I'm checking this, but this doesn't seem to be the error site that is being hit? The diff below passes. (Your updated test does indeed fail on master and pass after building this pull, as described in the OP.)

test diff

diff --git a/test/functional/feature_init.py b/test/functional/feature_init.py
index 194ba9ed265..38bd3d5db0f 100755
--- a/test/functional/feature_init.py
+++ b/test/functional/feature_init.py
@@ -107,6 +107,12 @@ class InitStressTest(BitcoinTestFramework):
             'blocks/blk*.dat': 'Corrupted block database detected.',
         }
 
+        errors_logged = [
+            "ERROR: LoadBlockIndex: block index is non-contiguous",
+            "LevelDB read failure: Corruption: not an sstable (bad magic number)",
+            "ReadBlockFromDisk(CBlock&, CBlockIndex*): GetHash() doesn't match index for CBlockIndex",
+        ]
+
         for file_patt, err_fragment in files_to_delete.items():
             target_files = list(node.chain_path.glob(file_patt))
 
@@ -126,6 +132,7 @@ class InitStressTest(BitcoinTestFramework):
             self.stop_node(0)
 
         self.log.info("Test startup errors after perturbing certain essential files")
+        i = 0
         for file_patt, err_fragment in files_to_perturb.items():
             shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak")
             shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak")
@@ -139,8 +146,10 @@ class InitStressTest(BitcoinTestFramework):
                     tweaked_contents[50:350] = b'1' * 300
                 with open(target_file, "wb") as tf_write:
                     tf_write.write(bytes(tweaked_contents))

-            start_expecting_error(err_fragment)
+            self.log.info(f"{i} - {err_fragment} - {errors_logged[i]}")
+            with node.assert_debug_log([errors_logged[i]]):
+                start_expecting_error(err_fragment)
+            i += 1
 
             shutil.rmtree(node.chain_path / "blocks")
             shutil.rmtree(node.chain_path / "chainstate")

Copy link
Contributor Author

@mzumsande mzumsande Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff below passes.

Where does it pass for you?
I tried to apply the diff too, and for me it passes on this branch and fails after removing 04d5337 (not because the log message isn't found, but because the bitcoin core crashes with the assert before that).
That is what I would have expected, since the diff adds an additional check for the added log "ERROR: LoadBlockIndex: block index is non-contiguous", which is supposed to be hit as a result of the perturbation of the block files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (coming from a non-native speaker): isn't non-continuous the more fitting description (meaning "(un)interrupted sequence") rather than non-contiguous (meaning "(not) sharing a common boundary")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native speaker either, but for me "contiguous" has more of a spatial connotation (like blocks building a chain by being attached to each other), whereas "continuous" has a more temporal connotation.

@TheCharlatan
Copy link
Contributor

Concept ACK

@mzumsande mzumsande force-pushed the 202306_feature_init_fix branch from 5352a98 to 25c1b21 Compare June 28, 2023 14:43
@mzumsande
Copy link
Contributor Author

Could rebase for CI?

done

If a height is missing we are facing a non-contiguous block index db, and could previously
hit an assert in GetAncestor() called from BuildSkip() instead of returning an error.
@mzumsande mzumsande force-pushed the 202306_feature_init_fix branch from 25c1b21 to 9d33681 Compare July 18, 2023 15:37
Comment on lines +265 to +268
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) {
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1);
}
previous_index = pindex;
Copy link
Member

@furszy furszy Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ad66ca1:

What if the genesis block index is missing?
The first round of the loop would set previous_index to block_index_1. Second one to block_index_2, etc. Making LoadBlockIndex() pass?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but the specific case of a missing (or incorrect) genesis block index should get caught in an extra check right after, see here.

There are other cases that wouldn't be caught with the proposed solution: We could have the valid chain and also some extra block indices that don't connect.
A stricter check would therefore be to require each block index of height x (except genesis) to have a predecessor of height x - 1. I didn't add it because these kind of extra blocks wouldn't actually hurt anything, plus maybe there are some of these block indices out there because of some bug in some ancient version of core.

Simultaneously opening the file in read and write mode would
lead to opening of an empty file instead of perturbing the existing
file.
Also, revert to the previous state after each perturbation so that
each perturbation is applied in isolation.
@mzumsande mzumsande force-pushed the 202306_feature_init_fix branch from 9d33681 to d27b9a2 Compare July 18, 2023 20:17
@mzumsande
Copy link
Contributor Author

9d33681 to d27b9a2:
Added an explanation why the tweaked_contents window of feature_init.py was changed: Since the genesis block is not checked by the -checkblocks verification, the perturbation window must be chosen such that a higher block in blk*.dat is affected.

Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code ACK d27b9a2

Comment on lines +139 to +142
# Since the genesis block is not checked by -checkblocks, the
# perturbation window must be chosen such that a higher block
# in blk*.dat is affected.
tweaked_contents[150:350] = b'1' * 200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could randomize (or restructure) this a bit in a follow-up. Perturbing different parts of the block content might let us detect and add coverage for different errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially sounds like a good first issue

Comment on lines +130 to +131
shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak")
shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: couldn't this be copied outside of the for-loop only once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could but then the moves below would need to be copies instead and then there would need to be another two lines needed for cleanup of the left-over files after the loop, so I think it's alright to keep as is.

Copy link
Contributor

@fjahr fjahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK d27b9a2

I have left a few suggestions for improvement but these could be done in a follow-up, along with suggestions from @furszy . If this is merged, I would probably give this to someone else as a good first issue.

Comment on lines +130 to +131
shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak")
shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could but then the moves below would need to be copies instead and then there would need to be another two lines needed for cleanup of the left-over files after the loop, so I think it's alright to keep as is.

Comment on lines +139 to +142
# Since the genesis block is not checked by -checkblocks, the
# perturbation window must be chosen such that a higher block
# in blk*.dat is affected.
tweaked_contents[150:350] = b'1' * 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially sounds like a good first issue

target_files = list(node.chain_path.glob(file_patt))

for target_file in target_files:
self.log.info(f"Perturbing file to ensure failure {target_file}")
with open(target_file, "rb") as tf_read, open(target_file, "wb") as tf_write:
with open(target_file, "rb") as tf_read:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't see why we are reading the content if we don't care about it anyway here. I think we can just override it and that simplifies things. I drafted that in this commit: fjahr@84132ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that is simpler!

for (CBlockIndex* pindex : vSortedByHeight) {
if (m_interrupt) return false;
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If you retouch I would like it if there was an additional comment how this could happen. The error message below explains what this checks but with these more obscure issues it's also great to know where it might come from without having to git blame it.

for (CBlockIndex* pindex : vSortedByHeight) {
if (ShutdownRequested()) return false;
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) {
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (coming from a non-native speaker): isn't non-continuous the more fitting description (meaning "(un)interrupted sequence") rather than non-contiguous (meaning "(not) sharing a common boundary")?

@fanquake fanquake requested a review from ryanofsky October 2, 2023 09:45
@achow101
Copy link
Member

achow101 commented Oct 4, 2023

ACK d27b9a2

@achow101 achow101 merged commit ab163b0 into bitcoin:master Oct 4, 2023
Copy link
Contributor Author

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews! I didn't get to address them before the merge (and I think some of the open suggestions could be incorporated in a follow-up).
I agree that besides the code simplifications, there is some room for improvement, and opened #28603 for this!

for (CBlockIndex* pindex : vSortedByHeight) {
if (ShutdownRequested()) return false;
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) {
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a native speaker either, but for me "contiguous" has more of a spatial connotation (like blocks building a chain by being attached to each other), whereas "continuous" has a more temporal connotation.

target_files = list(node.chain_path.glob(file_patt))

for target_file in target_files:
self.log.info(f"Perturbing file to ensure failure {target_file}")
with open(target_file, "rb") as tf_read, open(target_file, "wb") as tf_write:
with open(target_file, "rb") as tf_read:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that is simpler!

Frank-GER pushed a commit to syscoin/syscoin that referenced this pull request Oct 13, 2023
…guous, fix feature_init.py file perturbation

d27b9a2 test: fix feature_init.py file perturbation (Martin Zumsande)
ad66ca1 init: abort loading of blockindex in case of missing height. (Martin Zumsande)

Pull request description:

  When the block index database is non-contiguous due to file corruption (i.e. it contains indexes of height `x-1` and `x+1`, but not `x`), bitcoind can currently crash with an assert in `BuildSkip()` / `GetAncestor()` during `BlockManager::LoadBlockIndex()`:
  ```
  bitcoind: chain.cpp:112: const CBlockIndex* CBlockIndex::GetAncestor(int) const: Assertion `pindexWalk->pprev' failed.
  ```
  This PR changes it such that we instead return an `InitError` to the user.

  I stumbled upon this because I noticed that the file perturbation in `feature_init.py`  wasn't working as intended, which is fixed in the second commit:
  * Opening the file twice in one `with` statement would lead to `tf_read` being empty, so the test wouldn't perturb anything but replace the file with a new one. Fixed by first opening for read, then for write.
  * We need to restore the previous state after perturbations, so that only the current perturbation is active and not a mix of the current and previous ones.
  * I also added `checkblocks=200` to the startup parameters so that corruption in earlier blocks of `blk00000.dat` is detected during init verification and not ignored.

  After fixing `feature_init.py` like that I'd run into the `assert` mentioned above (so running the testfix from the second commit without the first one is a way to reproduce it).

ACKs for top commit:
  achow101:
    ACK d27b9a2
  furszy:
    Code ACK d27b9a2
  fjahr:
    Code review ACK d27b9a2

Tree-SHA512: 2e54da6030c5813c86bd58f816401e090bb43c5b834764a5e3c0e55dbfe09e423f88042cab823db3742088204b274d4ad2abf58a3832a4b18328b11a30bf7094
achow101 added a commit that referenced this pull request Nov 6, 2023
5ab6419 test: randomized perturbing in feature_init (L0la L33tz)
64b80d5 test: simplify feature_init (Fabian Jahr)

Pull request description:

  Fixes #28603

  Added suggested simplifications and implemented randomization

ACKs for top commit:
  theStack:
    Light ACK 5ab6419
  maflcko:
    lgtm ACK 5ab6419
  achow101:
    ACK 5ab6419

Tree-SHA512: e6f43eef7f8dd12c7fccbe437cb430dc9d383825d7ab2caa0382d061f88dec6d28522e1ec78f3f58f26d35cba93512fa21e330c48d06b1d8141a16f07050af5a
@mzumsande mzumsande deleted the 202306_feature_init_fix branch April 29, 2024 19:52
kwvg added a commit to kwvg/dash that referenced this pull request Jan 11, 2025
kwvg added a commit to kwvg/dash that referenced this pull request Jan 11, 2025
kwvg added a commit to kwvg/dash that referenced this pull request Jan 14, 2025
kwvg added a commit to kwvg/dash that referenced this pull request Jan 14, 2025
kwvg added a commit to kwvg/dash that referenced this pull request Jan 14, 2025
kwvg added a commit to kwvg/dash that referenced this pull request Jan 14, 2025
PastaPastaPasta added a commit to dashpay/dash that referenced this pull request Jan 15, 2025
, bitcoin#23736, bitcoin#23855, bitcoin#21464, bitcoin#24403, bitcoin#24968, bitcoin#24933, bitcoin#25358, bitcoin#27823, bitcoin#28253, bitcoin#28612 (auxiliary backports: part 22)

ce284ca merge bitcoin#28612: followups to bitcoin#27823 (Kittywhiskers Van Gogh)
ad7769b merge bitcoin#28253: display abrupt shutdown errors in console output (Kittywhiskers Van Gogh)
4965f0a merge bitcoin#27823: return error when block index is non-contiguous, fix feature_init.py file perturbation (Kittywhiskers Van Gogh)
9d375af merge bitcoin#25358: passing a value below 5 MB to -maxmempool should throw an error (Kittywhiskers Van Gogh)
530f9b2 merge bitcoin#24933: Replace non-threadsafe strerror (Kittywhiskers Van Gogh)
f53466d merge bitcoin#24968: Move TxOrphange tests to orphange_tests.cpp (Kittywhiskers Van Gogh)
5da523c merge bitcoin#24403: Avoid implicit-integer-sign-change in VerifyLoadedChainstate (Kittywhiskers Van Gogh)
5069a0b merge bitcoin#21464: Mempool Update Cut-Through Optimization (Kittywhiskers Van Gogh)
c908a22 merge bitcoin#23855: Post-"Chainstate loading sequence coalescence" fixups (Kittywhiskers Van Gogh)
5b2852f merge bitcoin#23736: call VerifyLoadedChainstate during ChainTestingSetup (Kittywhiskers Van Gogh)
1f5a7c1 trivial: fix padding in configure print (UdjinM6)
ed65719 merge bitcoin#23557: remove Bashism (Kittywhiskers Van Gogh)
e2a83f9 merge bitcoin#23498: remove unnecessary block rehashing prior to solving (Kittywhiskers Van Gogh)
57af825 merge bitcoin#23065: Allow UTXO locks to be written to wallet DB (Kittywhiskers Van Gogh)

Pull request description:

  ## Additional Information

  * In [bitcoin#23065](bitcoin#23065), the credit recalculation routine used in `CWallet::(Un)lockCoin` is spun-off into `CWallet::RecalculateMixedCredit()` and will retain its previous behavior of being unconditionally called (i.e. even if persistent (un)locking fails).

  ## Breaking Changes

  None expected.

  ## Checklist

  - [x] I have performed a self-review of my own code
  - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)**
  - [x] I have added or updated relevant unit/integration/functional/e2e tests
  - [x] I have made corresponding changes to the documentation
  - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

ACKs for top commit:
  UdjinM6:
    utACK ce284ca
  PastaPastaPasta:
    utACK ce284ca

Tree-SHA512: c07d8d4ac56a15107c22972ce9ccc73a72cdee9b872c87ffa2b594b590f97d7d99ca53220353e37cbcaea0e312c031dfcdd5da31dde0a4247fffafc15c392700
@bitcoin bitcoin locked and limited conversation to collaborators Apr 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants