Skip to content

Conversation

theStack
Copy link
Contributor

@theStack theStack commented Jan 26, 2021

Fixes #20934 by using the "sync up" method described in #20538 (comment).

After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.

For further details, also see the explanations in the commit messages.

There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with --valgrind):

#!/bin/sh
OUTPUT_FILE=./zmq_results
echo ===== repeated zmq test ===== > $OUTPUT_FILE

for i in `seq 1000`; do
    echo ------------------------
    echo ----- test run $i -----
    echo ------------------------
    echo --- $i --- >> $OUTPUT_FILE
    ./test/functional/interface_zmq.py --valgrind
    if [ $? -ne 0 ]; then
        echo "FAILED. /o\\" >> $OUTPUT_FILE
    else
        echo "PASSED. \\o/" >> $OUTPUT_FILE
    fi
done

echo Failed test runs:
grep FAILED $OUTPUT_FILE | wc -l

@fanquake fanquake added the Tests label Jan 26, 2021
@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 26, 2021

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

@theStack
Copy link
Contributor Author

Asked to mark this as draft since the test now fails on two cirrus instances. Have to investigate deeper what the problem is and how to properly reproduce it... locally 1000 test runs passed successfully.

@maflcko
Copy link
Member

maflcko commented Jan 26, 2021

You might want to try with --valgrind, which makes races more likely to happen locally

@theStack theStack force-pushed the 2021-test-zmq-fix_flakiness branch from fd93eea to c02f9a1 Compare January 27, 2021 18:19
@theStack
Copy link
Contributor Author

theStack commented Jan 27, 2021

@MarcoFalke: Thanks, that helped a lot! On master, running via --valgrind leads to a failed test run quite quickly on my machine.

The PR is ready for review now. The problem of my original approach was that most tests needed nodes 0 and 1 to have the same tip. So, after this robust "sync up" setup approach of repeatedly generating blocks, node 1 has to catch up. Added a parameter "sync_blocks" for that purpose, as for the last test, the chains are already different and synchronization is not possible in a trivial way. Now, I had locally a few hundred test runs with --valgrind and all of them passed.
Also adapted the PR description to include a script, that reviewers can use to test the robustness.

@theStack theStack marked this pull request as ready for review January 27, 2021 19:54
@jonatack
Copy link
Member

jonatack commented Jan 28, 2021

Concept ACK, will be great to robustify this test.

So far your script has run the test 25 times without errors for me (other than an unrelated issue that the test runner --valgrind option raises for me in general (not just on this test), but valgrind test/functional/interface_zmq.py works fine).

------------------------
----- test run 25 -----
------------------------
==42527== Memcheck, a memory error detector
==42527== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==42527== Using Valgrind-3.16.0.GIT and LibVEX; rerun with -h for copyright info
==42527== Command: test/functional/interface_zmq.py
==42527== 
2021-01-28T18:10:51.208000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_i65ciqkh
2021-01-28T18:10:54.197000Z TestFramework (INFO): Generate 5 blocks (and 5 coinbase txes)
2021-01-28T18:10:55.327000Z TestFramework (INFO): Wait for tx from second node
2021-01-28T18:10:56.502000Z TestFramework (INFO): Test the getzmqnotifications RPC
2021-01-28T18:10:56.505000Z TestFramework (INFO): Testing 'sequence' publisher
2021-01-28T18:10:57.900000Z TestFramework (INFO): Wait for tx from second node
2021-01-28T18:10:59.026000Z TestFramework (INFO): Testing sequence notifications with mempool sequence values
2021-01-28T18:10:59.027000Z TestFramework (INFO): Testing RBF notification
2021-01-28T18:11:26.168000Z TestFramework (INFO): Testing reorg notifications
2021-01-28T18:11:29.256000Z TestFramework (INFO): Evict mempool transaction by block conflict
2021-01-28T18:11:30.558000Z TestFramework (INFO): Testing 'mempool sync' usage of sequence notifier
2021-01-28T18:11:53.212000Z TestFramework (INFO): Stopping nodes
2021-01-28T18:11:53.469000Z TestFramework (INFO): Cleaning up /tmp/bitcoin_func_test_i65ciqkh on exit
2021-01-28T18:11:53.470000Z TestFramework (INFO): Tests successful
------------------------
----- test run 26 -----
------------------------
unrelated error with --valgrind test runner option

------------------------
----- test run 1 -----
------------------------
2021-01-28T17:22:07.128000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_awm_44bl
2021-01-28T17:22:34.997000Z TestFramework (ERROR): Unexpected exception caught during testing
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 108, in _request
    return self._get_response()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 168, in _get_response
    http_response = self.__conn.getresponse()
  File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/lib/python3.9/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.9/http/client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 125, in main
    self.setup()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 262, in setup
    self.setup_network()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 356, in setup_network
    self.setup_nodes()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 383, in setup_nodes
    self.import_deterministic_coinbase_privkeys()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 400, in import_deterministic_coinbase_privkeys
    self.init_wallet(i)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 407, in init_wallet
    n.createwallet(wallet_name=wallet_name, descriptors=self.options.descriptors, load_on_startup=True)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 676, in createwallet
    return self.__getattr__('createwallet')(wallet_name, disable_private_keys, blank, passphrase, avoid_reuse, descriptors, load_on_startup)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 113, in _request
    self.__conn.request(method, path, postdata, headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.9/http/client.py", line 950, in send
    self.connect()
  File "/usr/lib/python3.9/http/client.py", line 921, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.9/socket.py", line 843, in create_connection
    raise err
  File "/usr/lib/python3.9/socket.py", line 831, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
2021-01-28T17:22:35.049000Z TestFramework (INFO): Stopping nodes
2021-01-28T17:22:35.050000Z TestFramework.node0 (ERROR): Unable to stop node.
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 320, in stop_node
    self.stop(wait=wait)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 107, in _request
    self.__conn.request(method, path, postdata, headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1266, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.9/http/client.py", line 1092, in putrequest
    raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/./test/functional/interface_zmq.py", line 529, in <module>
    ZMQTest().main()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 149, in main
    exit_code = self.shutdown()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 278, in shutdown
    self.stop_nodes()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 525, in stop_nodes
    node.stop_node(wait=wait, wait_until_stopped=False)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 334, in stop_node
    raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr ==37264== Thread 10 b-httpworker.3:
==37264== Conditional jump or move depends on uninitialised value(s)
==37264==    at 0xB0DEC5: __log_putr.isra.2 (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB0F090: __log_put (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB99C1D: __crdel_metasub_log (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAD526A: __db_log_page (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB4068A: __bam_new_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE7A6D: __db_init_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB044ED: __fop_subdb_setup (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE6EB0: __db_open (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE1719: __db_open_pp (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAC1D3A: Db::open(DbTxn*, char const*, char const*, DBTYPE, unsigned int, int) (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0x873618: BerkeleyDatabase::Open() (bdb.cpp:345)
==37264==    by 0x873184: BerkeleyBatch::BerkeleyBatch(BerkeleyDatabase&, bool, bool) (bdb.cpp:308)
==37264== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:__log_putr.isra.2
   fun:__log_put
   fun:__crdel_metasub_log
   fun:__db_log_page
   fun:__bam_new_subdb
   fun:__db_init_subdb
   fun:__fop_subdb_setup
   fun:__db_open
   fun:__db_open_pp
   fun:_ZN2Db4openEP5DbTxnPKcS3_6DBTYPEji
   fun:_ZN16BerkeleyDatabase4OpenEv
   fun:_ZN13BerkeleyBatchC1ER16BerkeleyDatabasebb
}
==37264== 
==37264== Exit program on first error (--exit-on-first-error=yes) != 
[node 1] Cleaning up leftover process
[node 0] Cleaning up leftover process

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost-ACK. I was unable to make the test fail with valgrind both here and on master. The test does run ~2x faster on this branch, which is great.

@theStack theStack force-pushed the 2021-test-zmq-fix_flakiness branch from c02f9a1 to 9a9453d Compare January 29, 2021 01:14
@theStack
Copy link
Contributor Author

Force-pushed with changes suggested by @jonatack (#21008 (comment) and #21008 (comment)).

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 9a9453d

Might be good if @instagibbs had a look.

(IDK if he prefers instagibbs or Gregory Sanders for the commit credit)

The ZMQSubscriber reception methods currently assert that the first
received publisher message has a sequence number of zero. In order to
fix the current test flakiness via "syncing up" to nodes in the setup
phase, we have to cope with the situation that messages get lost and the
first actual received message has a sequence number larger than zero.
After connecting the subscriber sockets to the node, there is no
guarantee that the node's zmq publisher interfaces are ready yet, which
means that potentially the first expected notification messages could
get lost and the test fails. Currently this is handled by just waiting
for a short period of time (200ms), which works most of the time but is
still problematic, as in some rare cases the setup time takes much
longer, even in the range of multiple seconds.

The solution in this commit approaches the problem by using a more
robust method of syncing up, originally proposed by instagibbs:
    1. Generate a block on the node
    2. Try to receive a notification on all subscribers
    3. If all subscribers get a message within the timeout (1 second),
       we are done, otherwise repeat starting from step 1
Speeds up the zmq test roughly by a factor of 2x (~20 sec. instead of
~40 sec.) and also avoids timeouts on the synchronization methods
(sync_mempool() / sync_blocks()) that happened with a slight chance.
This is due to the fact that there is no upper bound on the trickle
relay time, so even the default of 60s is sometimes too low. Fixed by
enabling immediate tx relay on node1.
@theStack theStack force-pushed the 2021-test-zmq-fix_flakiness branch from 9a9453d to ef21fb7 Compare February 9, 2021 22:57
@theStack
Copy link
Contributor Author

theStack commented Feb 9, 2021

Force-pushed with a rebase on master and suggestions by jonatack (#21008 (comment) and #21008 (comment)).

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.

Thanks for adding the comment. It would be good for the tests to not be order-dependent.

@maflcko maflcko merged commit 3c9d9d2 into bitcoin:master Feb 16, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Feb 16, 2021
ef21fb7 zmq test: speedup test by whitelisting peers (immediate tx relay) (Sebastian Falbesoner)
5c65463 zmq test: fix flakiness by using more robust sync method (Sebastian Falbesoner)
8666033 zmq test: accept arbitrary sequence start number in ZMQSubscriber (Sebastian Falbesoner)
6014d6e zmq test: dedup message reception handling in ZMQSubscriber (Sebastian Falbesoner)

Pull request description:

  Fixes bitcoin#20934 by using the "sync up" method described in bitcoin#20538 (comment).

  After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times `sync_mempool()` or `sync_blocks()` timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.

  For further details, also see the explanations in the commit messages.

  There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with `--valgrind`):
  ```
  #!/bin/sh
  OUTPUT_FILE=./zmq_results
  echo ===== repeated zmq test ===== > $OUTPUT_FILE

  for i in `seq 1000`; do
      echo ------------------------
      echo ----- test run $i -----
      echo ------------------------
      echo --- $i --- >> $OUTPUT_FILE
      ./test/functional/interface_zmq.py --valgrind
      if [ $? -ne 0 ]; then
          echo "FAILED. /o\\" >> $OUTPUT_FILE
      else
          echo "PASSED. \\o/" >> $OUTPUT_FILE
      fi
  done

  echo Failed test runs:
  grep FAILED $OUTPUT_FILE | wc -l
  ```

ACKs for top commit:
  jonatack:
    Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.

Tree-SHA512: 7a1e7592fbbd98e69e1e1294486b91253e589c72b3c6bbb7f587028ec07cca59b7d984e4ebf256c4bc3e8a529ec77d31842f3dd874038aea0b684abfea50306a
maflcko pushed a commit that referenced this pull request Mar 2, 2021
…ated block

8a8c638 zmq test: fix sync-up by matching notification to generated block (Sebastian Falbesoner)

Pull request description:

  This is a follow-up PR for #21008, fixes #21216.

  In the course of investigating the problem with jnewbery (analyzing the Cirrus log https://cirrus-ci.com/task/4660108304056320), it turned out that the "sync up" procedure of repeatedly generating a block and waiting for a notification with timeout is too brittle in its current form, as the following scenario could happen:

  - generate block A
  - receive notification, timeout happens => repeat procedure
  - generate block B
  - node publishes block A notification
  - receive notification, we receive the one caused by block A (!!!) => sync-up procedure is completed
  - node publishes block B notification
  - the actual test starts
  - on the first notification reception, the one caused by block B is received, rather than the one actually caused by test code => assertion failure

  This change in the PR ensures that after each test block generation, we wait for the notification that is actually caused by that block and ignore others from possibly earlier blocks. The matching is kind of ugly, it assumes that one out of four components in the block is contained in the notification: the block hash, the tx id, the raw block data or the raw transaction data. (Unfortunately we have to support all publisher topics.)

  I'm aware that this is quite a lot of code now only for establishing a robust test setup. OTOH I wouldn't know of a better method right now, suggestions are very welcome.

  Note for potential reviewers: for both reproducing the issue on master branch and verifying on PR branch, one can simply generate two blocks in the sync-up procedure rather than one.

ACKs for top commit:
  MarcoFalke:
    Concept ACK 8a8c638

Tree-SHA512: a2eb78ba06dfd0fda7b1c111b6bbfb5dab4ab08500cc19c7ea02c3239495d5c74cc7d45250a8b3ecc78ca42d97ee6719bf73db8a137839e5e09a3cfcf08ed29e
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Nov 30, 2021
Summary:
PR description:
> Fixes [[bitcoin/bitcoin#20934 | core#20934]] by using the "sync up" method described in [[bitcoin/bitcoin#20538 | core#20538]].
>
> After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.
>
> For further details, also see the explanations in the commit messages.

Note that the changes in commit 4, the speedup from 45s to 14s runtime  using `noban` permissions, were already applied in Bitcoin ABC in D10311.

This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [1/3]
bitcoin/bitcoin@6014d6e

Depends on D10588

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10589
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Nov 30, 2021
Summary:
The ZMQSubscriber reception methods currently assert that the first
received publisher message has a sequence number of zero. In order to
fix the current test flakiness via "syncing up" to nodes in the setup
phase, we have to cope with the situation that messages get lost and the
first actual received message has a sequence number larger than zero.

This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [2/3]
bitcoin/bitcoin@8666033

Depends on D10589

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10590
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Nov 30, 2021
Summary:
After connecting the subscriber sockets to the node, there is no
guarantee that the node's zmq publisher interfaces are ready yet, which
means that potentially the first expected notification messages could
get lost and the test fails. Currently this is handled by just waiting
for a short period of time (200ms), which works most of the time but is
still problematic, as in some rare cases the setup time takes much
longer, even in the range of multiple seconds.

The solution in this commit approaches the problem by using a more
robust method of syncing up, originally proposed by instagibbs:
    1. Generate a block on the node
    2. Try to receive a notification on all subscribers
    3. If all subscribers get a message within the timeout (1 second),
       we are done, otherwise repeat starting from step 1

This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [3/3]
bitcoin/bitcoin@5c65463

Depends on D10590

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10591
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix zmq test flakiness
5 participants