-
Notifications
You must be signed in to change notification settings - Fork 37.7k
init: return error when block index is non-contiguous, fix feature_init.py file perturbation #27823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
Little disclaimer: It's my first Review for Bitcoin Core. So treat my input with care, I might have no clue what I am talking about. What I don't like about this change is that previously the user at least had a chance of finding out what's wrong, without debugging the code. Now he won't have any info of what went wrong other than "Error loading block database", which can have multiple other reasons. Is there a way for us to communicate this better? Maybe a log? Or can we throw here? The sad thing is that the return seems to be intermangled with none error cases like |
fe00062
to
6d853ab
Compare
src/node/blockstorage.cpp
Outdated
CBlockIndex* previous_index{nullptr}; | ||
for (CBlockIndex* pindex : vSortedByHeight) { | ||
if (ShutdownRequested()) return false; | ||
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) return false; // corruption, non-contiguous block indexes | ||
previous_index = pindex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't vSortedByHeight
contain blocks from different chains at the same height?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "different chains"? vSortedHeight
could contain blocks with the same height x (forks), but I can't see how these could have made it into the block index without having a predecessor of height x - 1 at the time they were added. Also, these shouldn't make the added check fail.
I do wonder though if it would be possible / more general to directly check that each block of height > 0 has a predecessor in the chain, instead of doing what the PR currently suggests. I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "different chains"?
vSortedHeight
could contain blocks with the same height x (forks), but I can't see how these could have made it into the block index without having a predecessor of height x - 1 at the time they were added. Also, these shouldn't make the added check fail.
yeah ok. nvm. Got confused by the early return (I thought it in the other way around).
The check will have the previous block at the same height, so pindex->nHeight > previous_index->nHeight + 1
will be false.
I do wonder though if it would be possible / more general to directly check that each block of height > 0 has a predecessor in the chain, instead of doing what the PR currently suggests. I'll look into that.
Wouldn't that be broken if we ever add a parallel headers download process?
Which IIRC we don't have because they are downloaded from a single peer at time. But if we implement it, we could get orphan block indexes pretty easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be broken if we ever add a parallel headers download process?
Which IIRC we don't have because they are downloaded from a single peer at time. But if we implement it, we could get orphan block indexes pretty easily.
Yes, but only we had a very naive parallel headers download process. I would imagine that if we ever parallelized headers download, we'd take care not to accept orphaned block indexes into our main block index (for DoS reasons) but would keep those headers in some kind of intermediate buffer until we can connect them.
net_processing
makes sure that we only save headers that we connect (code).
Thanks! Yes, adding a log seems like a good idea, I'll do that with the next push! |
6d853ab
to
5352a98
Compare
Added a Log error, and made the commit message more precise.
Since the current approach wouldn't catch all possible cases of a non-contiguous block index (but the one encountered in feature_init.py !), still looking to into having a stricter check - opinions welcome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
for (CBlockIndex* pindex : vSortedByHeight) { | ||
if (ShutdownRequested()) return false; | ||
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) { | ||
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've probably confused myself by how I'm checking this, but this doesn't seem to be the error site that is being hit? The diff below passes. (Your updated test does indeed fail on master and pass after building this pull, as described in the OP.)
test diff
diff --git a/test/functional/feature_init.py b/test/functional/feature_init.py
index 194ba9ed265..38bd3d5db0f 100755
--- a/test/functional/feature_init.py
+++ b/test/functional/feature_init.py
@@ -107,6 +107,12 @@ class InitStressTest(BitcoinTestFramework):
'blocks/blk*.dat': 'Corrupted block database detected.',
}
+ errors_logged = [
+ "ERROR: LoadBlockIndex: block index is non-contiguous",
+ "LevelDB read failure: Corruption: not an sstable (bad magic number)",
+ "ReadBlockFromDisk(CBlock&, CBlockIndex*): GetHash() doesn't match index for CBlockIndex",
+ ]
+
for file_patt, err_fragment in files_to_delete.items():
target_files = list(node.chain_path.glob(file_patt))
@@ -126,6 +132,7 @@ class InitStressTest(BitcoinTestFramework):
self.stop_node(0)
self.log.info("Test startup errors after perturbing certain essential files")
+ i = 0
for file_patt, err_fragment in files_to_perturb.items():
shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak")
shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak")
@@ -139,8 +146,10 @@ class InitStressTest(BitcoinTestFramework):
tweaked_contents[50:350] = b'1' * 300
with open(target_file, "wb") as tf_write:
tf_write.write(bytes(tweaked_contents))
- start_expecting_error(err_fragment)
+ self.log.info(f"{i} - {err_fragment} - {errors_logged[i]}")
+ with node.assert_debug_log([errors_logged[i]]):
+ start_expecting_error(err_fragment)
+ i += 1
shutil.rmtree(node.chain_path / "blocks")
shutil.rmtree(node.chain_path / "chainstate")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff below passes.
Where does it pass for you?
I tried to apply the diff too, and for me it passes on this branch and fails after removing 04d5337 (not because the log message isn't found, but because the bitcoin core crashes with the assert before that).
That is what I would have expected, since the diff adds an additional check for the added log "ERROR: LoadBlockIndex: block index is non-contiguous", which is supposed to be hit as a result of the perturbation of the block files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (coming from a non-native speaker): isn't non-continuous the more fitting description (meaning "(un)interrupted sequence") rather than non-contiguous (meaning "(not) sharing a common boundary")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a native speaker either, but for me "contiguous" has more of a spatial connotation (like blocks building a chain by being attached to each other), whereas "continuous" has a more temporal connotation.
Concept ACK |
5352a98
to
25c1b21
Compare
done |
If a height is missing we are facing a non-contiguous block index db, and could previously hit an assert in GetAncestor() called from BuildSkip() instead of returning an error.
25c1b21
to
9d33681
Compare
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) { | ||
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1); | ||
} | ||
previous_index = pindex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ad66ca1:
What if the genesis block index is missing?
The first round of the loop would set previous_index
to block_index_1. Second one to block_index_2, etc. Making LoadBlockIndex()
pass?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but the specific case of a missing (or incorrect) genesis block index should get caught in an extra check right after, see here.
There are other cases that wouldn't be caught with the proposed solution: We could have the valid chain and also some extra block indices that don't connect.
A stricter check would therefore be to require each block index of height x
(except genesis) to have a predecessor of height x - 1
. I didn't add it because these kind of extra blocks wouldn't actually hurt anything, plus maybe there are some of these block indices out there because of some bug in some ancient version of core.
Simultaneously opening the file in read and write mode would lead to opening of an empty file instead of perturbing the existing file. Also, revert to the previous state after each perturbation so that each perturbation is applied in isolation.
9d33681
to
d27b9a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code ACK d27b9a2
# Since the genesis block is not checked by -checkblocks, the | ||
# perturbation window must be chosen such that a higher block | ||
# in blk*.dat is affected. | ||
tweaked_contents[150:350] = b'1' * 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could randomize (or restructure) this a bit in a follow-up. Perturbing different parts of the block content might let us detect and add coverage for different errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially sounds like a good first issue
shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak") | ||
shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: couldn't this be copied outside of the for-loop only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could but then the move
s below would need to be copies instead and then there would need to be another two lines needed for cleanup of the left-over files after the loop, so I think it's alright to keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutil.copytree(node.chain_path / "blocks", node.chain_path / "blocks_bak") | ||
shutil.copytree(node.chain_path / "chainstate", node.chain_path / "chainstate_bak") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could but then the move
s below would need to be copies instead and then there would need to be another two lines needed for cleanup of the left-over files after the loop, so I think it's alright to keep as is.
# Since the genesis block is not checked by -checkblocks, the | ||
# perturbation window must be chosen such that a higher block | ||
# in blk*.dat is affected. | ||
tweaked_contents[150:350] = b'1' * 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially sounds like a good first issue
target_files = list(node.chain_path.glob(file_patt)) | ||
|
||
for target_file in target_files: | ||
self.log.info(f"Perturbing file to ensure failure {target_file}") | ||
with open(target_file, "rb") as tf_read, open(target_file, "wb") as tf_write: | ||
with open(target_file, "rb") as tf_read: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't see why we are reading the content if we don't care about it anyway here. I think we can just override it and that simplifies things. I drafted that in this commit: fjahr@84132ab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that is simpler!
for (CBlockIndex* pindex : vSortedByHeight) { | ||
if (m_interrupt) return false; | ||
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If you retouch I would like it if there was an additional comment how this could happen. The error message below explains what this checks but with these more obscure issues it's also great to know where it might come from without having to git blame it.
for (CBlockIndex* pindex : vSortedByHeight) { | ||
if (ShutdownRequested()) return false; | ||
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) { | ||
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (coming from a non-native speaker): isn't non-continuous the more fitting description (meaning "(un)interrupted sequence") rather than non-contiguous (meaning "(not) sharing a common boundary")?
ACK d27b9a2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews! I didn't get to address them before the merge (and I think some of the open suggestions could be incorporated in a follow-up).
I agree that besides the code simplifications, there is some room for improvement, and opened #28603 for this!
for (CBlockIndex* pindex : vSortedByHeight) { | ||
if (ShutdownRequested()) return false; | ||
if (previous_index && pindex->nHeight > previous_index->nHeight + 1) { | ||
return error("%s: block index is non-contiguous, index of height %d missing", __func__, previous_index->nHeight + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a native speaker either, but for me "contiguous" has more of a spatial connotation (like blocks building a chain by being attached to each other), whereas "continuous" has a more temporal connotation.
target_files = list(node.chain_path.glob(file_patt)) | ||
|
||
for target_file in target_files: | ||
self.log.info(f"Perturbing file to ensure failure {target_file}") | ||
with open(target_file, "rb") as tf_read, open(target_file, "wb") as tf_write: | ||
with open(target_file, "rb") as tf_read: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that is simpler!
…guous, fix feature_init.py file perturbation d27b9a2 test: fix feature_init.py file perturbation (Martin Zumsande) ad66ca1 init: abort loading of blockindex in case of missing height. (Martin Zumsande) Pull request description: When the block index database is non-contiguous due to file corruption (i.e. it contains indexes of height `x-1` and `x+1`, but not `x`), bitcoind can currently crash with an assert in `BuildSkip()` / `GetAncestor()` during `BlockManager::LoadBlockIndex()`: ``` bitcoind: chain.cpp:112: const CBlockIndex* CBlockIndex::GetAncestor(int) const: Assertion `pindexWalk->pprev' failed. ``` This PR changes it such that we instead return an `InitError` to the user. I stumbled upon this because I noticed that the file perturbation in `feature_init.py` wasn't working as intended, which is fixed in the second commit: * Opening the file twice in one `with` statement would lead to `tf_read` being empty, so the test wouldn't perturb anything but replace the file with a new one. Fixed by first opening for read, then for write. * We need to restore the previous state after perturbations, so that only the current perturbation is active and not a mix of the current and previous ones. * I also added `checkblocks=200` to the startup parameters so that corruption in earlier blocks of `blk00000.dat` is detected during init verification and not ignored. After fixing `feature_init.py` like that I'd run into the `assert` mentioned above (so running the testfix from the second commit without the first one is a way to reproduce it). ACKs for top commit: achow101: ACK d27b9a2 furszy: Code ACK d27b9a2 fjahr: Code review ACK d27b9a2 Tree-SHA512: 2e54da6030c5813c86bd58f816401e090bb43c5b834764a5e3c0e55dbfe09e423f88042cab823db3742088204b274d4ad2abf58a3832a4b18328b11a30bf7094
5ab6419 test: randomized perturbing in feature_init (L0la L33tz) 64b80d5 test: simplify feature_init (Fabian Jahr) Pull request description: Fixes #28603 Added suggested simplifications and implemented randomization ACKs for top commit: theStack: Light ACK 5ab6419 maflcko: lgtm ACK 5ab6419 achow101: ACK 5ab6419 Tree-SHA512: e6f43eef7f8dd12c7fccbe437cb430dc9d383825d7ab2caa0382d061f88dec6d28522e1ec78f3f58f26d35cba93512fa21e330c48d06b1d8141a16f07050af5a
… fix feature_init.py file perturbation
… fix feature_init.py file perturbation
… fix feature_init.py file perturbation
, bitcoin#23736, bitcoin#23855, bitcoin#21464, bitcoin#24403, bitcoin#24968, bitcoin#24933, bitcoin#25358, bitcoin#27823, bitcoin#28253, bitcoin#28612 (auxiliary backports: part 22) ce284ca merge bitcoin#28612: followups to bitcoin#27823 (Kittywhiskers Van Gogh) ad7769b merge bitcoin#28253: display abrupt shutdown errors in console output (Kittywhiskers Van Gogh) 4965f0a merge bitcoin#27823: return error when block index is non-contiguous, fix feature_init.py file perturbation (Kittywhiskers Van Gogh) 9d375af merge bitcoin#25358: passing a value below 5 MB to -maxmempool should throw an error (Kittywhiskers Van Gogh) 530f9b2 merge bitcoin#24933: Replace non-threadsafe strerror (Kittywhiskers Van Gogh) f53466d merge bitcoin#24968: Move TxOrphange tests to orphange_tests.cpp (Kittywhiskers Van Gogh) 5da523c merge bitcoin#24403: Avoid implicit-integer-sign-change in VerifyLoadedChainstate (Kittywhiskers Van Gogh) 5069a0b merge bitcoin#21464: Mempool Update Cut-Through Optimization (Kittywhiskers Van Gogh) c908a22 merge bitcoin#23855: Post-"Chainstate loading sequence coalescence" fixups (Kittywhiskers Van Gogh) 5b2852f merge bitcoin#23736: call VerifyLoadedChainstate during ChainTestingSetup (Kittywhiskers Van Gogh) 1f5a7c1 trivial: fix padding in configure print (UdjinM6) ed65719 merge bitcoin#23557: remove Bashism (Kittywhiskers Van Gogh) e2a83f9 merge bitcoin#23498: remove unnecessary block rehashing prior to solving (Kittywhiskers Van Gogh) 57af825 merge bitcoin#23065: Allow UTXO locks to be written to wallet DB (Kittywhiskers Van Gogh) Pull request description: ## Additional Information * In [bitcoin#23065](bitcoin#23065), the credit recalculation routine used in `CWallet::(Un)lockCoin` is spun-off into `CWallet::RecalculateMixedCredit()` and will retain its previous behavior of being unconditionally called (i.e. even if persistent (un)locking fails). ## Breaking Changes None expected. ## Checklist - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)** - [x] I have added or updated relevant unit/integration/functional/e2e tests - [x] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_ ACKs for top commit: UdjinM6: utACK ce284ca PastaPastaPasta: utACK ce284ca Tree-SHA512: c07d8d4ac56a15107c22972ce9ccc73a72cdee9b872c87ffa2b594b590f97d7d99ca53220353e37cbcaea0e312c031dfcdd5da31dde0a4247fffafc15c392700
When the block index database is non-contiguous due to file corruption (i.e. it contains indexes of height
x-1
andx+1
, but notx
), bitcoind can currently crash with an assert inBuildSkip()
/GetAncestor()
duringBlockManager::LoadBlockIndex()
:This PR changes it such that we instead return an
InitError
to the user.I stumbled upon this because I noticed that the file perturbation in
feature_init.py
wasn't working as intended, which is fixed in the second commit:with
statement would lead totf_read
being empty, so the test wouldn't perturb anything but replace the file with a new one. Fixed by first opening for read, then for write.checkblocks=200
to the startup parameters so that corruption in earlier blocks ofblk00000.dat
is detected during init verification and not ignored.After fixing
feature_init.py
like that I'd run into theassert
mentioned above (so running the testfix from the second commit without the first one is a way to reproduce it).