-
Notifications
You must be signed in to change notification settings - Fork 37.7k
test: Avoid race in disconnect_nodes helper #26138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The head ref may contain hidden characters: "2209-test-race-\u{1F344}"
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to verify the disconnect has been picked up by both nodes, so this seems like a good improvement to me.
In connect_nodes()
, relying on the number of connected/veracked peers instead of the actual peer seems like a very fragile approach though. I would expect it to break again in the future. I'm not sure we don't have better ways to check 2 nodes being connected?
For example, I think building on faeea28 this (MVP - open to improvements) would work:
git diff
diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
index b1164b98f..f79b931b6 100755
--- a/test/functional/test_framework/test_framework.py
+++ b/test/functional/test_framework/test_framework.py
@@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
def connect_nodes(self, a, b):
from_connection = self.nodes[a]
to_connection = self.nodes[b]
- from_num_peers = 1 + len(from_connection.getpeerinfo())
- to_num_peers = 1 + len(to_connection.getpeerinfo())
ip_port = "127.0.0.1:" + str(p2p_port(b))
from_connection.addnode(ip_port, "onetry")
# poll until version handshake complete to avoid race conditions
@@ -590,30 +588,27 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
# See comments in net_processing:
# * Must have a version message before anything else
# * Must have a verack message before anything else
- self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
- self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
- self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
- self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
+ def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
+ connected = from_node.get_connected_peers_info(to_node.index)
+ veracked = [peer for peer in connected if peer['bytesrecv_per_msg'].pop('verack', 0) == 24]
+ return len(veracked) > 0
+
+ self.wait_until(lambda: is_connection_established(from_connection, to_connection))
+ self.wait_until(lambda: is_connection_established(to_connection, from_connection))
+
def disconnect_nodes(self, a, b):
def disconnect_nodes_helper(node_a, node_b):
- def get_peer_ids(from_connection, node_num):
- result = []
- for peer in from_connection.getpeerinfo():
- if "testnode{}".format(node_num) in peer['subver']:
- result.append(peer['id'])
- return result
-
- peer_ids = get_peer_ids(node_a, node_b.index)
- if not peer_ids:
+ peers = node_a.get_connected_peers_info(node_b.index)
+ if not peers:
self.log.warning("disconnect_nodes: {} and {} were not connected".format(
node_a.index,
node_b.index,
))
return
- for peer_id in peer_ids:
+ for peer in peers:
try:
- node_a.disconnectnode(nodeid=peer_id)
+ node_a.disconnectnode(nodeid=peer["id"])
except JSONRPCException as e:
# If this node is disconnected between calculating the peer id
# and issuing the disconnect, don't worry about it.
@@ -622,8 +617,8 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
raise
# wait to disconnect
- self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5)
- self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5)
+ self.wait_until(lambda: not node_a.get_connected_peers_info(node_b.index), timeout=5)
+ self.wait_until(lambda: not node_b.get_connected_peers_info(node_a.index), timeout=5)
disconnect_nodes_helper(self.nodes[a], self.nodes[b])
diff --git a/test/functional/test_framework/test_node.py b/test/functional/test_framework/test_node.py
index e35cae006..73ad492a5 100755
--- a/test/functional/test_framework/test_node.py
+++ b/test/functional/test_framework/test_node.py
@@ -21,6 +21,7 @@ import collections
import shlex
import sys
from pathlib import Path
+from typing import Any, Dict, List
from .authproxy import JSONRPCException
from .descriptors import descsum_create
@@ -645,6 +646,13 @@ class TestNode():
return p2p_conn
+ def get_connected_peers_info(self, other_node_index: int) -> List[Dict[str, Any]]:
+ result = []
+ for peer in self.getpeerinfo():
+ if "testnode{}".format(other_node_index) in peer['subver']:
+ result.append(peer)
+ return result
+
def num_test_p2p_connections(self):
"""Return number of test framework p2p connections to the node."""
return len([peer for peer in self.getpeerinfo() if peer['subver'] == P2P_SUBVERSION])
|
The only way I can see it would "break" is when you attempt to connect to the same peer twice. In which case the failure is probably wanted. I am not sure if your approach silently dismisses that failure. |
Good point, I did not consider that. I've updated the diff (below) to keep it private. Also reduces the diff quite a bit. (Note:
Updated to only return git diffdiff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
index b1164b98f..fa3ddebe3 100755
--- a/test/functional/test_framework/test_framework.py
+++ b/test/functional/test_framework/test_framework.py
@@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
def connect_nodes(self, a, b):
from_connection = self.nodes[a]
to_connection = self.nodes[b]
- from_num_peers = 1 + len(from_connection.getpeerinfo())
- to_num_peers = 1 + len(to_connection.getpeerinfo())
ip_port = "127.0.0.1:" + str(p2p_port(b))
from_connection.addnode(ip_port, "onetry")
# poll until version handshake complete to avoid race conditions
@@ -590,10 +588,14 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
# See comments in net_processing:
# * Must have a version message before anything else
# * Must have a verack message before anything else
- self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
- self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
- self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
- self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
+ def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
+ connected = [peer for peer in from_connection.getpeerinfo() if f"testnode{b}" in peer["subver"]]
+ veracked = [peer for peer in connected if peer["bytesrecv_per_msg"].pop('verack', 0) == 24]
+ return len(veracked) == 1
+
+ self.wait_until(lambda: is_connection_established(from_connection, to_connection))
+ self.wait_until(lambda: is_connection_established(to_connection, from_connection))
+
def disconnect_nodes(self, a, b):
def disconnect_nodes_helper(node_a, node_b):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK faeea28
I would prefer making connect_nodes()
more robust too as per my suggestion, but I think this is a solid improvement on its own and my comment could be implemented in a separate PR if that's preferable or the proposal is contentious.
A few nits, no blockers.
node_a.index, | ||
node_b.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If you're touching, would use f-string instead:
self.log.warning(f"disconnect_nodes: {node_a.index} and {node_b.index} were not connected")
def disconnect_nodes_helper(node_a, node_b): | ||
def get_peer_ids(from_connection, node_num): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since you're touching, I think adding type hints here would be helpful. Also, for consistency would prefer calling it node
instead of connection
and index
instead of num
?
def disconnect_nodes_helper(node_a, node_b): | |
def get_peer_ids(from_connection, node_num): | |
def disconnect_nodes(self, a: int, b: int) -> None: | |
def disconnect_nodes_helper(node_a: TestNode, node_b: TestNode) -> None: | |
def get_peer_ids(from_node: TestNode, to_node_index: int) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the name as-is to make the diff minimal. If there are additional changes, they can be made in a follow-up.
If there is a follow-up, it would probably be fine to also make to_node_index
of type TestNode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stickies-v maybe you can follow up here?
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK faeea28
self.wait_until(lambda: not get_peer_ids(), timeout=5) | ||
self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5) | ||
self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I interpret this race to have been
- call
disconnect(1, 2)
, wait until 2 no longer shows up innode1.getpeerinfo()
- call
connect(0, 2)
. getto_num_peers = 2
when queryingnode2.getpeerinfo()
- node1 actually disconnected from node2. now there's only 1 peer in
node2.getpeerinfo()
. lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers
== 2- hang
Which is why we need to wait for both.
And now the order is
- call
disconnect(1, 2)
, wait until 2 no longer shows up innode1.getpeerinfo()
AND wait until 1 no longer shows up innode2.getpeerinfo()
- call
connect(0, 2)
. getto_num_peers = 1
when queryingnode2.getpeerinfo()
lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers
== 1
The functional test main thread is a single thread, so there should be nothing happening in the background. If something were to happen in the background, it is either irrelevant to the test, or it could result in a rare intermittent false-failure. |
Also wait for the other node to notice the closed socket. Otherwise, the other node is not able to use the connect helper.
Fixes #26014