-
Notifications
You must be signed in to change notification settings - Fork 37.7k
test: Avoid intermittent failures in feature_init #28831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
@@ -1,11 +1,10 @@ | |||
#!/usr/bin/env python3 | |||
# Copyright (c) 2021-2022 The Bitcoin Core developers | |||
# Copyright (c) 2021-present The Bitcoin Core developers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running this command
grep -nr --text "\-present" ./src
I only see these files using present
./src/streams.cpp:1:// Copyright (c) 2009-present The Bitcoin Core developers
./src/kernel/mempool_removal_reason.cpp:1:// Copyright (c) 2016-present The Bitcoin Core developers
./src/kernel/mempool_removal_reason.h:1:// Copyright (c) 2016-present The Bitcoin Core developers
if this is something we're going towards would we want to update the copy right of all files whenever there is a change to the file in any future PR's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scripted-diff for all files should be easy to rebase, see the thread: #26817 (comment)
Let me know if I should drop it here and leave it for later. I don't really mind either way, as I mostly care about not running into intermittent issues.
lgtm ACK 44445ae |
Can you link to the intermittent failures? Why not reduce the values instead? |
You can simply run the test in a loop to see it fail. For example:
|
Yes, this patch is reducing the value. |
Also, on a second thought, I don't really understand the goal of the randomized byte corruptions. Are they supposed to increase the test coverage for different code paths? If yes, I think it makes sense to list those code paths. Otherwise, if only a single error-code-path is covered, I think a deterministic test is better. I understand that the good first issue mentioned that advanced perturbations should be tested as well, but may require pulling in a leveldb parser. I don't think it makes sense to pull in a leveldb parser for this, unless there is a good reason. (Recall that Bitcoin Core does not support running on hardware that corrupts the datadir). If there was a "honest" pertubation caused by a user setting or different leveldb version that causes issues, that'd be a different case, and it would be a good reason to check that specifically. The same holds for mutating block files. Other than that, I don't think it makes sense to add a test for this. And in any case, adding a test for this, or modifying the existing test, should be a separate follow-up from fixing a bug by restoring the previous test. |
Maybe as a side-effect but it's rolling back the randomization completely. What I mean is reducing the max value of |
Why? Did you see my previous comment? |
It's fine if you want to remove the randomization as well as you say in the other comment but that's not what I meant when I said "Why not reduce the values instead?". You replied to that "Yes, this patch is reducing the value.". But that is not the only thing that it does and that's why it's not what I suggested. What I suggested is reducing the max value in |
If the test fails because the leveldb is smaller than the block files, I would propose keeping the randomization for the block files and adding a second deterministic test for leveldb? |
Hm, this doesn't fail for me on master.
|
Jup, that is what I mean when I say we don't want to pull in a leveldb dependency into this test, unless there is a reason. leveldb has different code paths depending on the operating system. You are using macOS and I was using Linux. I presume you'll be able to reproduce if you also use Linux. Alternatively, you can try to run the test in a loop on macOS, but I can't do that, because I don't have macOS. |
Yes, I considered this and I think it could make sense. However, it may require reading the file size and using that as the maximum, or other changes. So for now, I made it trivial to review by just reverting, so that the CI and tests are no longer failing intermittently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test fails because the leveldb is smaller than the block files, I would propose keeping the randomization for the block files and adding a second deterministic test for leveldb?
That was also my initial thought when reading this PR's description. In contrast to the chainstate leveldb files, the block files are fully in our control and possible faulty behaviour caused by perturbation should (hopefully) be identical on all operating systems, as there is no external library involved.
I agree with @maflcko though that the highest priority now is to fix the CI, so reverting seems to be a reasonable first step. If someone is working on a follow-up introducing the random perturbation, feel free to ping me as reviewer.
ACK 44445ae
Same here, happy to review |
I couldn't reproduce the failure on Ubuntu 20.04 either.
@theStack Can you link to a CI failure? Maybe I can reproduce it with that info. |
Did you run the test in a loop on Linux? So far two people reported local issues. There are also CI failures, for example https://cirrus-ci.com/task/5031659184062464?logs=ci#L4069 (This one also shows a memory leak due to the corruption, so that'd be another reason to not fiddle with levedb files, as it will cause an CI error, even if Bitcoin Core shut down with the correct error message) |
So I was actually able to reproduce this by just hardcoding the max values of the random range:
failure details
I tried to look at the file sizes of the ldb files involved but the file where the test fails is deleted, not really sure why as of now ls result
Instead I edited the log message to include a file size for the ldb files:
failure result
So there are smaller ldb files that are perturbed first that don't trigger the issue, it's actually a bigger one where the failure happens. I haven't looked into this deeper yet but my best guess would be that ldb doesn't care about this last, big file and doesn't check it? |
No, that's not it, otherwise the hardcoded test wouldn't fail. If we don't understand what leveldb is doing then I guess we should probably remove perturbing its files altogether. |
sgtm, for a follow-up or alternative pull. The goal of this pull is not to change any test, but only to restore the previous code, which didn't intermittently fail. |
ACK 44445ae |
The code not only modifies block dat files, but also leveldb files, which may be of smaller size. Such corruption may not force leveldb to abort, according to the intermittent test failures.
Fix the intermittent test failures by reverting 5ab6419 .