Skip to content

Conversation

mzumsande
Copy link
Contributor

@mzumsande mzumsande commented Nov 3, 2022

p2p_sendtxrcncl.py currently fails intermittently in the CI, see e.g. https://cirrus-ci.com/task/5511952184115200?logs=ci#L4024

I believe that this is related to the reuse of the parameter p2p_idx=2 of add_outbound_p2p_connection in this test: When we call peer_disconnect, we don't wait until the node has completed the disconnection. So there is a race between setting up the next connection (next addconnection RPC), and if the old one hasn't been removed and has an identical port like the new one (because we didn't increment p2p_idx), CConnman::OpenNetworkConnection just returns without establishing a connection, and the test fails.

Fix this by using distinct disconnect_p2ps instead of peer_disconnect, which waits for the disconnect to complete. We can then use the same value for p2p_idx everywhere.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I am not really a fan of writing racy tests unless racyness is the goal of the test.

peer = self.nodes[0].add_outbound_p2p_connection(
SendTxrcnclReceiver(), wait_for_verack=True, p2p_idx=1, connection_type="outbound-full-relay")
SendTxrcnclReceiver(), wait_for_verack=True, p2p_idx=ob_peer_id, connection_type="outbound-full-relay")
assert peer.sendtxrcncl_msg_received
assert peer.sendtxrcncl_msg_received.initiator
assert not peer.sendtxrcncl_msg_received.responder
assert_equal(peer.sendtxrcncl_msg_received.version, 1)
peer.peer_disconnect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If there is no guarantee that this will disconnect the peer (due to races), it might be better to remove it, or replace it with disconnect_p2ps, which would also avoid having to modify the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I used disconnect_p2ps everywhere instead of peer_disconnect, and also changed p2p_idx=0 everywhere for consistency - this should also work.

@mzumsande mzumsande force-pushed the 202211_fix_sendtxrcncl branch from b1cc46d to 1e533dc Compare November 3, 2022 20:35
Using disconnect_p2ps instead of peer_disconnect makes
the node wait for the disconnect to complete. As a result,
we can reuse p2p_idx=0 in the add_outbound_p2p_connection calls.
@mzumsande mzumsande force-pushed the 202211_fix_sendtxrcncl branch from 1e533dc to 74d9753 Compare November 3, 2022 20:42
@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 4, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, however this fixes one race, but introduces two more


self.log.info('SENDTXRCNCL if block-relay-only triggers a disconnect')
peer = self.nodes[0].add_outbound_p2p_connection(
PeerNoVerack(), wait_for_verack=False, p2p_idx=3, connection_type="block-relay-only")
PeerNoVerack(), wait_for_verack=False, p2p_idx=0, connection_type="block-relay-only")
with self.nodes[0].assert_debug_log(["we indicated no tx relay; disconnecting"]):
peer.send_message(create_sendtxrcncl_msg(initiator=False))
peer.wait_for_disconnect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is racy, so you'd have to call self.nodes[0].disconnect_p2ps() or use a unique index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why? We called disconnect_p2ps() before add_outbound_p2p_connection, so are guaranteed to have no peer when doing so. Within thewith block we call wait_for_disconnect (which waits until the disconnect initiated by the node is completed), so the subsequent subtest also shouldn't be able to run before the disconnect of the current one is completed. What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. As the disconnect happens on the side of the node, there shouldn't be a race where the node is not aware of the disconnect.


p2p_idx must be different for simultaneously connected peers. When reusing it for the next peer
after disconnecting the previous one, it is necessary to wait for the disconnect to finish to avoid
a race condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that disconnect_p2ps avoids the race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to describe it in more general terms because disconnect_p2ps can only be used if there is no other, unrelated connection that we need to keep.

@maflcko
Copy link
Member

maflcko commented Nov 4, 2022

review ACK 74d9753

@maflcko maflcko merged commit e42ba13 into bitcoin:master Nov 4, 2022
@mzumsande mzumsande deleted the 202211_fix_sendtxrcncl branch November 4, 2022 17:31
@jonatack
Copy link
Member

jonatack commented Nov 4, 2022

Post-merge ACK

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Nov 5, 2022
…l.py

74d9753 test: fix intermittent failure in p2p_sendtxrcncl.py (Martin Zumsande)

Pull request description:

  `p2p_sendtxrcncl.py` currently fails intermittently in the CI, see e.g. https://cirrus-ci.com/task/5511952184115200?logs=ci#L4024

  I believe that this is related to the reuse of the parameter `p2p_idx=2` of `add_outbound_p2p_connection` in this test: When we call `peer_disconnect`, we don't wait until the node has completed the disconnection. So there is a race between setting up the next connection (next  `addconnection` RPC), and if the old one hasn't been removed and has an identical port like the new one (because we didn't increment `p2p_idx`), `CConnman::OpenNetworkConnection` just [returns](https://github.com/bitcoin/bitcoin/blob/5274f324375fd31cf8507531fbc612765d03092f/src/net.cpp#L1976) without establishing a connection, and the test fails.

  Fix this by using distinct `disconnect_p2ps` instead of `peer_disconnect`, which waits for the disconnect to complete. We can then use the same value for `p2p_idx` everywhere.

ACKs for top commit:
  MarcoFalke:
    review ACK 74d9753

Tree-SHA512: f99f2550b6b320c0a2416a475c1cf189c009fce3a5abf1d4462486e1bfe309e2c3fd4228a4009b0ca38cb77465ce85e3d22298719eb07302fa0a72fbab0e0668
@bitcoin bitcoin locked and limited conversation to collaborators Nov 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants