-
Notifications
You must be signed in to change notification settings - Fork 37.7k
test: fix zmq test flakiness, improve speed #21008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
Asked to mark this as draft since the test now fails on two cirrus instances. Have to investigate deeper what the problem is and how to properly reproduce it... locally 1000 test runs passed successfully. |
You might want to try with |
fd93eea
to
c02f9a1
Compare
@MarcoFalke: Thanks, that helped a lot! On master, running via The PR is ready for review now. The problem of my original approach was that most tests needed nodes 0 and 1 to have the same tip. So, after this robust "sync up" setup approach of repeatedly generating blocks, node 1 has to catch up. Added a parameter "sync_blocks" for that purpose, as for the last test, the chains are already different and synchronization is not possible in a trivial way. Now, I had locally a few hundred test runs with |
Concept ACK, will be great to robustify this test. So far your script has run the test 25 times without errors for me (other than an unrelated issue that the test runner
unrelated error with --valgrind test runner option
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost-ACK. I was unable to make the test fail with valgrind both here and on master. The test does run ~2x faster on this branch, which is great.
c02f9a1
to
9a9453d
Compare
Force-pushed with changes suggested by @jonatack (#21008 (comment) and #21008 (comment)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 9a9453d
Might be good if @instagibbs had a look.
(IDK if he prefers instagibbs or Gregory Sanders for the commit credit)
The ZMQSubscriber reception methods currently assert that the first received publisher message has a sequence number of zero. In order to fix the current test flakiness via "syncing up" to nodes in the setup phase, we have to cope with the situation that messages get lost and the first actual received message has a sequence number larger than zero.
After connecting the subscriber sockets to the node, there is no guarantee that the node's zmq publisher interfaces are ready yet, which means that potentially the first expected notification messages could get lost and the test fails. Currently this is handled by just waiting for a short period of time (200ms), which works most of the time but is still problematic, as in some rare cases the setup time takes much longer, even in the range of multiple seconds. The solution in this commit approaches the problem by using a more robust method of syncing up, originally proposed by instagibbs: 1. Generate a block on the node 2. Try to receive a notification on all subscribers 3. If all subscribers get a message within the timeout (1 second), we are done, otherwise repeat starting from step 1
Speeds up the zmq test roughly by a factor of 2x (~20 sec. instead of ~40 sec.) and also avoids timeouts on the synchronization methods (sync_mempool() / sync_blocks()) that happened with a slight chance. This is due to the fact that there is no upper bound on the trickle relay time, so even the default of 60s is sometimes too low. Fixed by enabling immediate tx relay on node1.
9a9453d
to
ef21fb7
Compare
Force-pushed with a rebase on master and suggestions by jonatack (#21008 (comment) and #21008 (comment)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.
Thanks for adding the comment. It would be good for the tests to not be order-dependent.
ef21fb7 zmq test: speedup test by whitelisting peers (immediate tx relay) (Sebastian Falbesoner) 5c65463 zmq test: fix flakiness by using more robust sync method (Sebastian Falbesoner) 8666033 zmq test: accept arbitrary sequence start number in ZMQSubscriber (Sebastian Falbesoner) 6014d6e zmq test: dedup message reception handling in ZMQSubscriber (Sebastian Falbesoner) Pull request description: Fixes bitcoin#20934 by using the "sync up" method described in bitcoin#20538 (comment). After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times `sync_mempool()` or `sync_blocks()` timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test. For further details, also see the explanations in the commit messages. There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with `--valgrind`): ``` #!/bin/sh OUTPUT_FILE=./zmq_results echo ===== repeated zmq test ===== > $OUTPUT_FILE for i in `seq 1000`; do echo ------------------------ echo ----- test run $i ----- echo ------------------------ echo --- $i --- >> $OUTPUT_FILE ./test/functional/interface_zmq.py --valgrind if [ $? -ne 0 ]; then echo "FAILED. /o\\" >> $OUTPUT_FILE else echo "PASSED. \\o/" >> $OUTPUT_FILE fi done echo Failed test runs: grep FAILED $OUTPUT_FILE | wc -l ``` ACKs for top commit: jonatack: Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this. Tree-SHA512: 7a1e7592fbbd98e69e1e1294486b91253e589c72b3c6bbb7f587028ec07cca59b7d984e4ebf256c4bc3e8a529ec77d31842f3dd874038aea0b684abfea50306a
…ated block 8a8c638 zmq test: fix sync-up by matching notification to generated block (Sebastian Falbesoner) Pull request description: This is a follow-up PR for #21008, fixes #21216. In the course of investigating the problem with jnewbery (analyzing the Cirrus log https://cirrus-ci.com/task/4660108304056320), it turned out that the "sync up" procedure of repeatedly generating a block and waiting for a notification with timeout is too brittle in its current form, as the following scenario could happen: - generate block A - receive notification, timeout happens => repeat procedure - generate block B - node publishes block A notification - receive notification, we receive the one caused by block A (!!!) => sync-up procedure is completed - node publishes block B notification - the actual test starts - on the first notification reception, the one caused by block B is received, rather than the one actually caused by test code => assertion failure This change in the PR ensures that after each test block generation, we wait for the notification that is actually caused by that block and ignore others from possibly earlier blocks. The matching is kind of ugly, it assumes that one out of four components in the block is contained in the notification: the block hash, the tx id, the raw block data or the raw transaction data. (Unfortunately we have to support all publisher topics.) I'm aware that this is quite a lot of code now only for establishing a robust test setup. OTOH I wouldn't know of a better method right now, suggestions are very welcome. Note for potential reviewers: for both reproducing the issue on master branch and verifying on PR branch, one can simply generate two blocks in the sync-up procedure rather than one. ACKs for top commit: MarcoFalke: Concept ACK 8a8c638 Tree-SHA512: a2eb78ba06dfd0fda7b1c111b6bbfb5dab4ab08500cc19c7ea02c3239495d5c74cc7d45250a8b3ecc78ca42d97ee6719bf73db8a137839e5e09a3cfcf08ed29e
Summary: PR description: > Fixes [[bitcoin/bitcoin#20934 | core#20934]] by using the "sync up" method described in [[bitcoin/bitcoin#20538 | core#20538]]. > > After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test. > > For further details, also see the explanations in the commit messages. Note that the changes in commit 4, the speedup from 45s to 14s runtime using `noban` permissions, were already applied in Bitcoin ABC in D10311. This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [1/3] bitcoin/bitcoin@6014d6e Depends on D10588 Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10589
Summary: The ZMQSubscriber reception methods currently assert that the first received publisher message has a sequence number of zero. In order to fix the current test flakiness via "syncing up" to nodes in the setup phase, we have to cope with the situation that messages get lost and the first actual received message has a sequence number larger than zero. This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [2/3] bitcoin/bitcoin@8666033 Depends on D10589 Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10590
Summary: After connecting the subscriber sockets to the node, there is no guarantee that the node's zmq publisher interfaces are ready yet, which means that potentially the first expected notification messages could get lost and the test fails. Currently this is handled by just waiting for a short period of time (200ms), which works most of the time but is still problematic, as in some rare cases the setup time takes much longer, even in the range of multiple seconds. The solution in this commit approaches the problem by using a more robust method of syncing up, originally proposed by instagibbs: 1. Generate a block on the node 2. Try to receive a notification on all subscribers 3. If all subscribers get a message within the timeout (1 second), we are done, otherwise repeat starting from step 1 This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [3/3] bitcoin/bitcoin@5c65463 Depends on D10590 Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10591
Fixes #20934 by using the "sync up" method described in #20538 (comment).
After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times
sync_mempool()
orsync_blocks()
timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.For further details, also see the explanations in the commit messages.
There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with
--valgrind
):