Skip to content

Conversation

maflcko
Copy link
Member

@maflcko maflcko commented Jun 9, 2024

Currently the sync in connect_nodes mentions the version and verack message types, but only checks the verack. Neither check is required, as the pong check implies both. In case of failure, the debug log will have to be consulted anyway, so the redundant check doesn't add value.

Also clarify in the comments that the goal is to check the flag fSuccessfullyConnected indirectly.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 9, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK furszy, brunoerg, tdb3

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Copy link
Contributor

@tdb3 tdb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK fa4f453
Makes sense. Version handshake occurs before ping/pong.

As a sanity check, ran functional tests to see the tests continue to function properly (all passed). Also viewed node traffic between two regtest nodes, and saw ping/pong occur after version/verack exchange.

@maflcko maflcko force-pushed the 2406-test-connect-sync branch from fa4f453 to fa66277 Compare June 9, 2024 15:17
Copy link
Contributor

@tdb3 tdb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK fa66277

Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code ACK fa66277 with a comment.

@maflcko maflcko force-pushed the 2406-test-connect-sync branch from fa66277 to 0000276 Compare June 10, 2024 05:58
Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 0000276

@DrahtBot DrahtBot requested a review from tdb3 June 10, 2024 12:53
Copy link
Contributor

@brunoerg brunoerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 0000276

Copy link
Contributor

@tdb3 tdb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 0000276

@fanquake fanquake merged commit 0fbb804 into bitcoin:master Jun 11, 2024
@maflcko maflcko deleted the 2406-test-connect-sync branch June 12, 2024 09:34
panleone pushed a commit to panleone/PIVX that referenced this pull request Nov 11, 2024
0000276 test: Remove redundant verack check (MarcoFalke)

Pull request description:

  Currently the sync in `connect_nodes` mentions the `version` and `verack` message types, but only checks the `verack`. Neither check is required, as the `pong` check implies both. In case of failure, the debug log will have to be consulted anyway, so the redundant check doesn't add value.

  Also clarify in the comments that the goal is to check the flag `fSuccessfullyConnected` indirectly.

ACKs for top commit:
  furszy:
    utACK 0000276
  brunoerg:
    ACK 0000276
  tdb3:
    ACK 0000276

Tree-SHA512: f9ddcb1436d2f70da462a8dd470ecfc90a534dd6507c23877ef7626e7c02326c077001a42ad0171a87fba5c5275d1970d8c5e5d82c56c8412de944856fdfd6db
Fuzzbawls added a commit to PIVX-Project/PIVX that referenced this pull request Nov 15, 2024
5f8c06e test: Add possibility to skip awaiting for the connection (Alessandro Rezzi)
5374229 Merge bitcoin#30252: test: Remove redundant verack check (merge-script)
e7c21c7 Merge bitcoin#30118: test: improve robustness of connect_nodes() (Ava Chow)
9beef8d Merge bitcoin#26854: test: Fix intermittent timeout in p2p_permissions.py (MarcoFalke)
b3a3d78 Merge bitcoin#25443: test: Fail if connect_nodes fails (laanwj)
72709b9 test: Avoid connecting a peer to himself (Alessandro Rezzi)
77697b8 test: Do not connect the nodes in parallel (Alessandro Rezzi)
17c065d test: Avoid race after connect_nodes (MarcoFalke)
d73ac82 test: refactor connect_nodes and disconnect_nodes to take two indices instead of index + node (Alessandro Rezzi)
e19d72f Merge bitcoin#18866: test: Fix verack race to avoid intermittent test failures (MarcoFalke)

Pull request description:

  This PR solves the failures of `wallet_listtransactions.py`, like the one of  https://github.com/PIVX-Project/PIVX/actions/runs/11705208154/job/32601252220?pr=2947.

  They happen due to mempool sync timeout:
  ```
  2024-11-06T14:58:03.8489793Z AssertionError: Mempool sync timed out after 60s:
  2024-11-06T14:58:03.8490785Z   {'7364836e7eae24a75378b920373e303b99b4ff18db758defc4c057d784a43905'}
  ```

  The issue is that the connection between nodes is established after the transaction is sent, as we can see from the logs:
  ```
  2024-11-06T14:58:03.9085473Z 2024-11-06T14:57:02Z (mocktime: 2019-10-31T18:21:20Z) Relaying wtx 7364836e7eae24a75378b920373e303b99b4ff18db758defc4c057d784a43905
  ...
  2024-11-06T14:58:03.9103287Z 2024-11-06T14:57:02Z (mocktime: 2019-10-31T18:21:20Z) New outbound peer connected: version: 70927, blocks=200, peer=0
  ```

  Hence the newly connected node will never receive the transaction and the mempool will never be synced.

  This bug is fixed by ensuring that `connect_nodes` actually wait for the connection to be established. As a consequence of those checks we cannot anymore connect nodes in parallel in `connect_nodes_clique` (which will make tests run slightly slower)

ACKs for top commit: 5f8c06e
  Fuzzbawls:
    utACK 5f8c06e
  Duddino:
    utACK 5f8c06e
  Liquid369:
    tACK 5f8c06e

Tree-SHA512: 88007d7302f3b7c3c5b9d446e7d8acc959cd03b0bb27409b1633cb86c57905a4814be148e2e5f6ccaa1da17eccdd44f68f81d00299696d1f47f52f9b12b32ec7
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Feb 11, 2025
Summary:
```
The 'connect_nodes' function in the test framework relies
on a stable number of peer connections to verify the new
connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in
the process can drop, lose, or create a connection at any
step, causing subsequent 'wait_until' checks to stall
indefinitely even when the peers in question are connected
successfully.

This commit improves the situation by using the nodes' subversion
and the connection direction (inbound/outbound) to identify the
exact peer connection and perform the checks exclusively on it.
```

Backport of [[bitcoin/bitcoin#30118 | core#30118]] and [[bitcoin/bitcoin#30252 | core#30252]].

Test Plan:
  ninja check-functional

Reviewers: #bitcoin_abc, PiRK

Reviewed By: #bitcoin_abc, PiRK

Differential Revision: https://reviews.bitcoinabc.org/D17655
roqqit pushed a commit to doged-io/doged that referenced this pull request May 2, 2025
Summary:
```
The 'connect_nodes' function in the test framework relies
on a stable number of peer connections to verify the new
connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in
the process can drop, lose, or create a connection at any
step, causing subsequent 'wait_until' checks to stall
indefinitely even when the peers in question are connected
successfully.

This commit improves the situation by using the nodes' subversion
and the connection direction (inbound/outbound) to identify the
exact peer connection and perform the checks exclusively on it.
```

Backport of [[bitcoin/bitcoin#30118 | core#30118]] and [[bitcoin/bitcoin#30252 | core#30252]].

Test Plan:
  ninja check-functional

Reviewers: #bitcoin_abc, PiRK

Reviewed By: #bitcoin_abc, PiRK

Differential Revision: https://reviews.bitcoinabc.org/D17655
knst pushed a commit to knst/dash that referenced this pull request Jun 3, 2025
0000276 test: Remove redundant verack check (MarcoFalke)

Pull request description:

  Currently the sync in `connect_nodes` mentions the `version` and `verack` message types, but only checks the `verack`. Neither check is required, as the `pong` check implies both. In case of failure, the debug log will have to be consulted anyway, so the redundant check doesn't add value.

  Also clarify in the comments that the goal is to check the flag `fSuccessfullyConnected` indirectly.

ACKs for top commit:
  furszy:
    utACK 0000276
  brunoerg:
    ACK 0000276
  tdb3:
    ACK 0000276

Tree-SHA512: f9ddcb1436d2f70da462a8dd470ecfc90a534dd6507c23877ef7626e7c02326c077001a42ad0171a87fba5c5275d1970d8c5e5d82c56c8412de944856fdfd6db
PastaPastaPasta added a commit to dashpay/dash that referenced this pull request Jun 3, 2025
…rack check

1163aec Merge bitcoin#30252: test: Remove redundant verack check (merge-script)

Pull request description:

  ## Issue being fixed or feature implemented
  Functional test p2p_node_network_limited.py is occasionally fails with error while waiting `pong` message:

      2025-06-03T08:29:40.701000Z TestFramework (INFO): Mine enough blocks to reach the NODE_NETWORK_LIMITED range.
      2025-06-03T08:30:40.761000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ''''
              self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'pong', 29), sleep=0.05)
      '''
      2025-06-03T08:30:40.762000Z TestFramework (ERROR): Assertion failed
      Traceback (most recent call last):
        File "DASH/test/functional/test_framework/test_framework.py", line 162, in main
          self.run_test()
        File "DASH/test/functional/p2p_node_network_limited.py", line 69, in run_test
          self.connect_nodes(0, 1)
        File "DASH/test/functional/test_framework/test_framework.py", line 736, in connect_nodes
        File "DASH/test/functional/test_framework/test_framework.py", line 903, in wait_until
          # Private helper methods. These should not be accessed by the subclass test scripts.
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "DASH/test/functional/test_framework/util.py", line 287, in wait_until_helper
          raise AssertionError("Predicate {} not true after {} seconds".format(predicate_source, timeout))
      AssertionError: Predicate ''''
              self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'pong', 29), sleep=0.05)
      ''' not true after 60.0 seconds

  ot it fails while just waiting connection:

      2025-06-03T08:42:35.518000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ''''
              self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
      '''
      2025-06-03T08:42:35.518000Z TestFramework (ERROR): Assertion failed
      Traceback (most recent call last):
        File "DASH/test/functional/test_framework/test_framework.py", line 162, in main
          self.run_test()
        File "DASH/test/functional/p2p_node_network_limited.py", line 83, in run_test
          self.connect_nodes(0, 2)
        File "DASH/test/functional/test_framework/test_framework.py", line 726, in connect_nodes
          self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
        File "DASH/test/functional/test_framework/test_framework.py", line 905, in wait_until
          return wait_until_helper(test_function, timeout=timeout, lock=lock, timeout_factor=self.options.timeout_factor, sleep=sleep, do_assert=do_assert)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "DASH/test/functional/test_framework/util.py", line 287, in wait_until_helper
          raise AssertionError("Predicate {} not true after {} seconds".format(predicate_source, timeout))
      AssertionError: Predicate ''''
              self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
      ''' not true after 60.0 seconds

  This failure happens, because node 2, once connected to node 0:
   - requests headers
   - receive headers
   - sync headers
   - request blocks
   - got disconnected

  Timeline is here (some lines are omitted)
  ```
   node2 2025-06-03T07:28:25.210493Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: version (148 bytes) peer=0
   node2 2025-06-03T07:28:25.211692Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: sendaddrv2 (0 bytes) peer=0
   node2 2025-06-03T07:28:25.211715Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: verack (0 bytes) peer=0
   node2 2025-06-03T07:28:25.211868Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: getaddr (0 bytes) peer=0
   node2 2025-06-03T07:28:25.211966Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: sendheaders2 (0 bytes) peer=0
   node2 2025-06-03T07:28:25.211989Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: sendcmpct (9 bytes) peer=0
   node2 2025-06-03T07:28:25.212007Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: senddsq (1 bytes) peer=0
   node2 2025-06-03T07:28:25.212026Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: ping (8 bytes) peer=0
   node2 2025-06-03T07:28:25.212063Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: getheaders2 (677 bytes) peer=0
   node2 2025-06-03T07:28:25.221532Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: pong (8 bytes) peer=0
   node2 2025-06-03T07:28:25.221574Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: headers2 (11441 bytes) peer=0
  ...
   node2 2025-06-03T07:28:25.432813Z (mocktime: 1417713337) [   msghand] [validationinterface.cpp:270] [NotifyHeaderTip] [validation] NotifyHeaderTip: accepted block header hash=585e13971c392a7042a9ad56355a2e3912329ffa95c7d42cdf69929b9424248d initial=0
   node2 2025-06-03T07:28:25.432853Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3032] [HeadersDirectFetchBlocks] [net] Large reorg, won't direct fetch to 585e13971c392a7042a9ad56355a2e3912329ffa95c7d42cdf69929b9424248d (292)
   node2 2025-06-03T07:28:25.432894Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:6283] [SendMessages] [net] Requesting block 3de6b559afe72b8d41caaa6789031825de056b175a607b5facac4357c696bff2 (1) peer=0
   node0 2025-06-03T07:28:25.447535Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:3518] [ProcessMessage] [net] received: getdata (577 bytes) peer=2
   node0 2025-06-03T07:28:25.447556Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:4194] [ProcessMessage] [net] received getdata (16 invsz) peer=2
   node0 2025-06-03T07:28:25.447574Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:4197] [ProcessMessage] [net] received getdata for: block 3de6b559afe72b8d41caaa6789031825de056b175a607b5facac4357c696bff2 peer=2
   node0 2025-06-03T07:28:25.447600Z (mocktime: 1417713337) [   msghand] [net_processing.cpp:2568] [ProcessGetBlockData] [net] Ignore block request below NODE_NETWORK_LIMITED threshold, disconnect peer=2
  ```

  All of that happen while `connect_nodes` in python code still validates if nodes already connected and it that would be eventually faster that python code moves to blocks syncing, the test will fail.

  ## What was done?
  This PR helps `connect_nodes` to work faster to increase chance to complete before nodes will be disconnected.

  The race is still here but it happens less often.

   - backport bitcoin#30252

  See also https://github.com/dashpay/dash/pull/6631/files which has further improvements for `connect_nodes` by reducing default sleeps from 0.5 to 0.05

  ## How Has This Been Tested?
  This PR should increase reliability of other functional tests as well, because it's not the only test that I have seen failed on connect_nodes; but p2p_node_network_limited.py only one that I tested.

  To reproduce this issue I run stress benchmark that occupy all cores on my machine and after that run p2p_node_network_limited.py in 40 jobs.

  This PR decreased amount of failures from 25% error rate to just 10%.

  ## Breaking Changes
  N/A

  ## Checklist:
  - [x] I have performed a self-review of my own code
  - [ ] I have commented my code, particularly in hard-to-understand areas
  - [ ] I have added or updated relevant unit/integration/functional/e2e tests
  - [ ] I have made corresponding changes to the documentation
  - [x] I have assigned this pull request to a milestone

ACKs for top commit:
  UdjinM6:
    utACK 1163aec
  PastaPastaPasta:
    utACK 1163aec

Tree-SHA512: 3f38ab8b0786468048f12c44b2014bfacf4908f55965c3ef554fc7304043d057603feb99c2ad6f9919da24da7ab40d8480467f6fcfd317aff3072236145d71f2
@bitcoin bitcoin locked and limited conversation to collaborators Jun 12, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants