Skip to content

Conversation

maflcko
Copy link
Member

@maflcko maflcko commented Sep 20, 2022

Also wait for the other node to notice the closed socket. Otherwise, the other node is not able to use the connect helper.

Fixes #26014

@maflcko maflcko added this to the 24.0 milestone Sep 20, 2022
@fanquake fanquake added the Tests label Sep 20, 2022
Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to verify the disconnect has been picked up by both nodes, so this seems like a good improvement to me.

In connect_nodes(), relying on the number of connected/veracked peers instead of the actual peer seems like a very fragile approach though. I would expect it to break again in the future. I'm not sure we don't have better ways to check 2 nodes being connected?

For example, I think building on faeea28 this (MVP - open to improvements) would work:

git diff
diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
index b1164b98f..f79b931b6 100755
--- a/test/functional/test_framework/test_framework.py
+++ b/test/functional/test_framework/test_framework.py
@@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
     def connect_nodes(self, a, b):
         from_connection = self.nodes[a]
         to_connection = self.nodes[b]
-        from_num_peers = 1 + len(from_connection.getpeerinfo())
-        to_num_peers = 1 + len(to_connection.getpeerinfo())
         ip_port = "127.0.0.1:" + str(p2p_port(b))
         from_connection.addnode(ip_port, "onetry")
         # poll until version handshake complete to avoid race conditions
@@ -590,30 +588,27 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
         # See comments in net_processing:
         # * Must have a version message before anything else
         # * Must have a verack message before anything else
-        self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
-        self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
+        def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
+            connected = from_node.get_connected_peers_info(to_node.index)
+            veracked = [peer for peer in connected if peer['bytesrecv_per_msg'].pop('verack', 0) == 24]
+            return len(veracked) > 0
+
+        self.wait_until(lambda: is_connection_established(from_connection, to_connection))
+        self.wait_until(lambda: is_connection_established(to_connection, from_connection))
+
 
     def disconnect_nodes(self, a, b):
         def disconnect_nodes_helper(node_a, node_b):
-            def get_peer_ids(from_connection, node_num):
-                result = []
-                for peer in from_connection.getpeerinfo():
-                    if "testnode{}".format(node_num) in peer['subver']:
-                        result.append(peer['id'])
-                return result
-
-            peer_ids = get_peer_ids(node_a, node_b.index)
-            if not peer_ids:
+            peers = node_a.get_connected_peers_info(node_b.index)
+            if not peers:
                 self.log.warning("disconnect_nodes: {} and {} were not connected".format(
                     node_a.index,
                     node_b.index,
                 ))
                 return
-            for peer_id in peer_ids:
+            for peer in peers:
                 try:
-                    node_a.disconnectnode(nodeid=peer_id)
+                    node_a.disconnectnode(nodeid=peer["id"])
                 except JSONRPCException as e:
                     # If this node is disconnected between calculating the peer id
                     # and issuing the disconnect, don't worry about it.
@@ -622,8 +617,8 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
                         raise
 
             # wait to disconnect
-            self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5)
-            self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5)
+            self.wait_until(lambda: not node_a.get_connected_peers_info(node_b.index), timeout=5)
+            self.wait_until(lambda: not node_b.get_connected_peers_info(node_a.index), timeout=5)
 
         disconnect_nodes_helper(self.nodes[a], self.nodes[b])
 
diff --git a/test/functional/test_framework/test_node.py b/test/functional/test_framework/test_node.py
index e35cae006..73ad492a5 100755
--- a/test/functional/test_framework/test_node.py
+++ b/test/functional/test_framework/test_node.py
@@ -21,6 +21,7 @@ import collections
 import shlex
 import sys
 from pathlib import Path
+from typing import Any, Dict, List
 
 from .authproxy import JSONRPCException
 from .descriptors import descsum_create
@@ -645,6 +646,13 @@ class TestNode():
 
         return p2p_conn
 
+    def get_connected_peers_info(self, other_node_index: int) -> List[Dict[str, Any]]:
+        result = []
+        for peer in self.getpeerinfo():
+            if "testnode{}".format(other_node_index) in peer['subver']:
+                result.append(peer)
+        return result
+
     def num_test_p2p_connections(self):
         """Return number of test framework p2p connections to the node."""
         return len([peer for peer in self.getpeerinfo() if peer['subver'] == P2P_SUBVERSION])

@maflcko
Copy link
Member Author

maflcko commented Sep 21, 2022

get_connected_peers_info would only work with Bitcoin Core peers, not mininode peers, so I'd prefer not to make it public

@maflcko
Copy link
Member Author

maflcko commented Sep 21, 2022

I would expect it to break again in the future.

The only way I can see it would "break" is when you attempt to connect to the same peer twice. In which case the failure is probably wanted. I am not sure if your approach silently dismisses that failure.

@stickies-v
Copy link
Contributor

get_connected_peers_info would only work with Bitcoin Core peers, not mininode peers, so I'd prefer not to make it public

Good point, I did not consider that. I've updated the diff (below) to keep it private. Also reduces the diff quite a bit. (Note: get_peer_ids() would benefit from the same list comprehension imo, but not a required change for this to work)

The only way I can see it would "break" is when you attempt to connect to the same peer twice. In which case the failure is probably wanted. I am not sure if your approach silently dismisses that failure.

Updated to only return True if we have exactly 1 connection to b, which I think addresses your concern. I don't have any specific ideas of how it would break otherwise, but since connect_nodes() doesn't own the nodes, I think it's not unreasonable to assume that the node may update its peers in the background? If without additional complexity we can check that a is now connected to b and not just to any new node, that seems like a strict improvement to me?

git diff
diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
index b1164b98f..fa3ddebe3 100755
--- a/test/functional/test_framework/test_framework.py
+++ b/test/functional/test_framework/test_framework.py
@@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
     def connect_nodes(self, a, b):
         from_connection = self.nodes[a]
         to_connection = self.nodes[b]
-        from_num_peers = 1 + len(from_connection.getpeerinfo())
-        to_num_peers = 1 + len(to_connection.getpeerinfo())
         ip_port = "127.0.0.1:" + str(p2p_port(b))
         from_connection.addnode(ip_port, "onetry")
         # poll until version handshake complete to avoid race conditions
@@ -590,10 +588,14 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
         # See comments in net_processing:
         # * Must have a version message before anything else
         # * Must have a verack message before anything else
-        self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
-        self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
+        def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
+            connected = [peer for peer in from_connection.getpeerinfo() if f"testnode{b}" in peer["subver"]]
+            veracked = [peer for peer in connected if peer["bytesrecv_per_msg"].pop('verack', 0) == 24]
+            return len(veracked) == 1
+
+        self.wait_until(lambda: is_connection_established(from_connection, to_connection))
+        self.wait_until(lambda: is_connection_established(to_connection, from_connection))
+
 
     def disconnect_nodes(self, a, b):
         def disconnect_nodes_helper(node_a, node_b):

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK faeea28

I would prefer making connect_nodes() more robust too as per my suggestion, but I think this is a solid improvement on its own and my comment could be implemented in a separate PR if that's preferable or the proposal is contentious.

A few nits, no blockers.

Comment on lines +610 to +611
node_a.index,
node_b.index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If you're touching, would use f-string instead:

                self.log.warning(f"disconnect_nodes: {node_a.index} and {node_b.index} were not connected")

Comment on lines +599 to +600
def disconnect_nodes_helper(node_a, node_b):
def get_peer_ids(from_connection, node_num):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since you're touching, I think adding type hints here would be helpful. Also, for consistency would prefer calling it node instead of connection and index instead of num?

Suggested change
def disconnect_nodes_helper(node_a, node_b):
def get_peer_ids(from_connection, node_num):
def disconnect_nodes(self, a: int, b: int) -> None:
def disconnect_nodes_helper(node_a: TestNode, node_b: TestNode) -> None:
def get_peer_ids(from_node: TestNode, to_node_index: int) -> List[int]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the name as-is to make the diff minimal. If there are additional changes, they can be made in a follow-up.

If there is a follow-up, it would probably be fine to also make to_node_index of type TestNode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stickies-v maybe you can follow up here?

@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 23, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

Copy link
Member

@glozow glozow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK faeea28

Comment on lines -625 to +626
self.wait_until(lambda: not get_peer_ids(), timeout=5)
self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5)
self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I interpret this race to have been

  1. call disconnect(1, 2), wait until 2 no longer shows up in node1.getpeerinfo()
  2. call connect(0, 2). get to_num_peers = 2 when querying node2.getpeerinfo()
  3. node1 actually disconnected from node2. now there's only 1 peer in node2.getpeerinfo().
  4. lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers == 2
  5. hang

Which is why we need to wait for both.

And now the order is

  1. call disconnect(1, 2), wait until 2 no longer shows up in node1.getpeerinfo() AND wait until 1 no longer shows up in node2.getpeerinfo()
  2. call connect(0, 2). get to_num_peers = 1 when querying node2.getpeerinfo()
  3. lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers == 1

@glozow glozow merged commit b2da6dd into bitcoin:master Sep 28, 2022
@maflcko
Copy link
Member Author

maflcko commented Sep 28, 2022

I think it's not unreasonable to assume that the node may update its peers in the background?

The functional test main thread is a single thread, so there should be nothing happening in the background. If something were to happen in the background, it is either irrelevant to the test, or it could result in a rare intermittent false-failure.

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Sep 28, 2022
fanquake added a commit that referenced this pull request Sep 29, 2022
faeea28 test: Avoid race in disconnect_nodes helper (MacroFake)

Pull request description:

  Backport of #26138

ACKs for top commit:
  fanquake:
    ACK faeea28

Tree-SHA512: f967c38750220bd6c245db953055f8e6d5402b3a24081ca03795a8403c2ed4eab772b2e9c2d3b581c3bc55d191dd4e22711b5f97d39856d676f10799fc64a9c7
@maflcko maflcko deleted the 2209-test-race-🍄 branch September 29, 2022 09:27
@bitcoin bitcoin locked and limited conversation to collaborators Sep 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test: failure in wallet_reorgsrestore.py
5 participants