Skip to content

Conversation

maflcko
Copy link
Member

@maflcko maflcko commented Oct 18, 2021

The most frequent failure in functional tests are intermittent races. Fixing such bugs is cumbersome because it involves:

  • Noticing the failure
  • Fetching and reading the log to determine the test case that failed
  • Adding a self.sync_all() where it was forgotten
  • Spamming out a pr and waiting for review, which is already sparse

Also, writing a linter to catch those is not possible, nor is review effective in finding these bugs prior to merge.

Fix all future intermittent races caused by a missing sync_block call by calling sync_all implicitly after each generate*, unless opted out. This ensures that the code is race-free (with regards to blocks) when the tests pass once, instead of our current approach where the code can never be guaranteed to be race-free.

There are some scripted-diff cleanups (see #22567), but they will be submitted in a follow-up to reduce the conflicts in this pull.

@practicalswift
Copy link
Contributor

Concept ACK

@DrahtBot DrahtBot added the Tests label Oct 18, 2021
@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 18, 2021

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #23371 (test: MiniWallet: add P2TR support and use it per default by theStack)
  • #23075 (refactoring: Fee estimation functional test cleanups by darosior)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

def no_op(self):
pass

def generate(self, generator, *args, sync_fun=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making it a required parameter, otherwise there's a high risk of backports breaking.

Copy link
Member Author

@maflcko maflcko Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll backport any major test changes to 22.x (or earlier) at this point, and hopefully this pull is merged before the next major release, so backports to 23.x aren't an issue.

If reviewers insist that this should be done, I suggest doing it in a follow up as a scripted-diff to make it easy to review and keep the conflicts in this pull low.

@hebasto
Copy link
Member

hebasto commented Nov 8, 2021

Concept ACK.

Copy link
Contributor

@lsilva01 lsilva01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK facc352 on Ubuntu 20.04

Copy link
Contributor

@brunoerg brunoerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK facc352 on MacOS 11.6

Nice. It solves the failure I was getting when running test_runner. I think there are some other tests you could apply it like rpc_packages.py.

@maflcko maflcko merged commit 94db963 into bitcoin:master Nov 9, 2021
@maflcko maflcko deleted the 2110-testSyncImp branch November 9, 2021 09:01
@maflcko
Copy link
Member Author

maflcko commented Nov 9, 2021

I think there are some other tests you could apply it like rpc_packages.py.

Can you explain this a bit more? Or maybe create a follow-up yourself?

@hebasto
Copy link
Member

hebasto commented Nov 9, 2021

FWIW, was testing this PR, and with the following patch:

diff --git a/.cirrus.yml b/.cirrus.yml
index f7962e951..93866de06 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -164,7 +164,7 @@ task:
     - netsh int ipv4 set dynamicport tcp start=1025 num=64511
     - netsh int ipv6 set dynamicport tcp start=1025 num=64511
     # Exclude feature_dbcrash for now due to timeout
-    - python test\functional\test_runner.py --nocleanup --ci --quiet --combinedlogslen=4000 --jobs=4 --timeout-factor=8 --failfast --extended --exclude feature_dbcrash
+    - python test\functional\test_runner.py --nocleanup --ci --quiet --combinedlogslen=4000 --jobs=4 --failfast --extended --exclude feature_dbcrash
 
 task:
   name: 'ARM [unit tests, no functional tests] [bullseye]'
diff --git a/ci/test/00_setup_env.sh b/ci/test/00_setup_env.sh
index 8a9d808f5..0685d1f0c 100755
--- a/ci/test/00_setup_env.sh
+++ b/ci/test/00_setup_env.sh
@@ -41,7 +41,7 @@ export RUN_SECURITY_TESTS=${RUN_SECURITY_TESTS:-false}
 # By how much to scale the test_runner timeouts (option --timeout-factor).
 # This is needed because some ci machines have slow CPU or disk, so sanitizers
 # might be slow or a reindex might be waiting on disk IO.
-export TEST_RUNNER_TIMEOUT_FACTOR=${TEST_RUNNER_TIMEOUT_FACTOR:-40}
+export TEST_RUNNER_TIMEOUT_FACTOR=${TEST_RUNNER_TIMEOUT_FACTOR:-1}
 export TEST_RUNNER_ENV=${TEST_RUNNER_ENV:-}
 export RUN_FUZZ_TESTS=${RUN_FUZZ_TESTS:-false}
 export EXPECTED_TESTS_DURATION_IN_SECONDS=${EXPECTED_TESTS_DURATION_IN_SECONDS:-1000}

all tasks, except for the native Windows one, pass.

In the Windows task two tests fail due timeouts:

feature_pruning.py                                 | ✖ Failed  | 1381 s
rpc_misc.py                                        | ✖ Failed  | 63 s

@maflcko
Copy link
Member Author

maflcko commented Nov 9, 2021

Thanks for testing. Do you have logs for the two timeouts?

@hebasto
Copy link
Member

hebasto commented Nov 9, 2021

Thanks for testing. Do you have logs for the two timeouts?

https://api.cirrus-ci.com/v1/task/4512194625798144/logs/functional_tests.log

@maflcko
Copy link
Member Author

maflcko commented Nov 9, 2021

The pruning one has a reorg and a call to verifychain, which may take a long time, so this is expected:

                                   test_framework.authproxy.JSONRPCException: 'verifychain' RPC took longer than 60.000000 seconds. Consider using larger timeout for calls that take longer to return. (-344)

The rpc_misc one looks like an unrelated issue.

@hebasto
Copy link
Member

hebasto commented Nov 9, 2021

The rpc_misc one looks like an unrelated issue.

Could be easy tuned:

@@ -84,7 +84,7 @@ class RpcMiscTest(BitcoinTestFramework):
 
         # Restart the node with indices and wait for them to sync
         self.restart_node(0, ["-txindex", "-blockfilterindex", "-coinstatsindex"])
-        self.wait_until(lambda: all(i["synced"] for i in node.getindexinfo().values()))
+        self.wait_until(lambda: all(i["synced"] for i in node.getindexinfo().values()), timeout=120)
 
         # Returns a list of all running indices by default
         values = {"synced": True, "best_block_height": 200}

@maflcko
Copy link
Member Author

maflcko commented Nov 9, 2021

The sync threads all exit within a second, so I don't understand why raising the timeout would fix this issue.

@hebasto
Copy link
Member

hebasto commented Nov 9, 2021

@MarcoFalke

The sync threads all exit within a second...

That is not the case in CI environment. See my demo branch, and its CI log:

2021-11-09T11:28:10.405000Z TestFramework (INFO): Total time taken by `wait_until` call: 52.64929151535034 seconds.

@maflcko
Copy link
Member Author

maflcko commented Nov 9, 2021

Oh you are right.

Any idea why coinstatsindex takes more than a minute on windows, but finishes in less than a second on linux?

@hebasto
Copy link
Member

hebasto commented Nov 9, 2021

Any idea why coinstatsindex takes more than a minute on windows, but finishes in less than a second on linux?

Have no idea.

Locally, on Windows 10 Pro 21H1, it takes about 30 seconds.

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Nov 9, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Nov 9, 2021
maflcko pushed a commit to bitcoin-core/gui that referenced this pull request Nov 15, 2021
…on native Windows CI task

23c3dcb ci: Drop --failfast in functional tests on native Windows CI task (Hennadii Stepanov)

Pull request description:

  As it was [discussed](bitcoin/bitcoin#22980 (comment)) in bitcoin/bitcoin#22980:
  > seeing _all_ of the failed tests can give us a useful hint for debugging (was thinking about that while reviewing and testing bitcoin/bitcoin#23300).

  There is a [concern](bitcoin/bitcoin#22980 (comment)) about such approach:
  > If there is a CI failure, it will be good to know the result as early as possible after opening the pull request.

  But, [OTOH](bitcoin/bitcoin#22980 (comment)):
  > the average amount of saved time for such an approach [using `--failfast`] is less significant than it could appear.

ACKs for top commit:
  MarcoFalke:
    cr ACK 23c3dcb seems fine to give this a try

Tree-SHA512: d28f5712b4edfdbcef48b0633017da9172cef1835bcea51eaeeabf15c133f6bb6888227afc130279c3899365a4fd0f200fb9b0c4cb1ff80f21cbc766b8907764
@bitcoin bitcoin locked and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants