-
Notifications
You must be signed in to change notification settings - Fork 37.7k
rpc, test: add sendmsgtopeer
rpc and a test for net-level deadlock situation
#28287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
ee730b4
to
f9e8be6
Compare
f9e8be6
to
7b76e09
Compare
Concept ACK |
Concept ACK. Probably should add some basic functional tests to rpc_net.py with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Feel free to ignore the nits.
|
Concept ACK this would be useful. |
61fc011
to
d1404b3
Compare
Thanks for the reviews! I pushed an update with changes due to feedback, haven't addressed @ajtowns suggestion yet:
Will add this to
updated with the same comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK d1404b3
This rpc can be used when we want a node to send a message, but cannot use a python P2P object, for example for testing of low-level net transport behavior.
d1404b3
to
f25b13a
Compare
I added a commit for basic coverage in
I added coverage for >4MB. I couldn't test the case >32MB, because already at ~16MB I get a failure passing a string that long via RPC (independent on the actual RPC, so this can't be handled within |
f25b13a
to
b3a93b4
Compare
ACK b3a93b4 LGTM. |
ACK b3a93b4 Two non-blocking comments:
|
ACK b3a93b4 |
…r net-level deadlock situation b3a93b4 test: add functional test for deadlock situation (Martin Zumsande) 3557aa4 test: add basic tests for sendmsgtopeer to rpc_net.py (Martin Zumsande) a9a1d69 rpc: add test-only sendmsgtopeer rpc (Martin Zumsande) Pull request description: This adds a `sendmsgtopeer` rpc (for testing only) that allows a node to send a message (provided in hex) to a peer. While we would usually use a `p2p` object instead of a node for this in the test framework, that isn't possible in situations where this message needs to trigger an actual interaction of multiple nodes. Use this rpc to add test coverage for the bug fixed in bitcoin#27981 (that just got merged): The test lets two nodes (almost) simultaneously send a single large (4MB) p2p message to each other, which would have caused a deadlock previously (making this test fail), but succeeds now. As can be seen from the discussion in bitcoin#27981, it was not easy to reproduce this bug without `sendmsgtopeer`. I would imagine that `sendmsgtopeer` could also be helpful in various other test constellations. ACKs for top commit: ajtowns: ACK b3a93b4 sipa: ACK b3a93b4 achow101: ACK b3a93b4 Tree-SHA512: 6e22e72402f3c4dd70cddb9e96ea988444720f7a164031df159fbdd48056c8ac77ac53def045d9208a3ca07437c7c8e34f8b4ebc7066c0a84d81cd53f2f4fa5f
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
The fix introduced in bitcoin#27981 (8c986d6, dash#6067) is not validated to work until bitcoin#28287, an upcoming backport. As the latter has identified the former backport didn't pass validation, let's fix it so that the latter tests succeed when they're backported in an upcoming commit.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
The fix introduced in bitcoin#27981 (8c986d6, dash#6067) is not validated to work until bitcoin#28287, an upcoming backport. As the latter has identified the former backport didn't pass validation, let's fix it so that the latter tests succeed when they're backported in an upcoming commit.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
The fix introduced in bitcoin#27981 (8c986d6, dash#6067) is not validated to work until bitcoin#28287, an upcoming backport. As the latter has identified the former backport didn't pass validation, let's fix it so that the latter tests succeed when they're backported in an upcoming commit.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
… deadlock situation `random_bytes()` is introduced in bitcoin#25625 but the function def alone doesn't warrant a full backport, so we'll only implement the section relevant to this PR.
, bitcoin#23774, bitcoin#25443, bitcoin#26138, bitcoin#26854, bitcoin#27128, bitcoin#27761, bitcoin#27863, bitcoin#28287, bitcoin#30118, partial bitcoin#22778 (auxiliary backports: part 16) e458adb merge bitcoin#30118: improve robustness of connect_nodes() (UdjinM6) ac94de2 merge bitcoin#28287: add `sendmsgtopeer` rpc and a test for net-level deadlock situation (Kittywhiskers Van Gogh) d1fce0b fix: ensure that deadlocks are actually resolved (Kittywhiskers Van Gogh) 19e7bf6 merge bitcoin#27863: do not break when addr is not from a distinct network group (Kittywhiskers Van Gogh) 1adb9a2 merge bitcoin#27761: Log addresses of stalling peers (Kittywhiskers Van Gogh) 2854a6a merge bitcoin#27128: fix intermittent issue in `p2p_disconnect_ban` (Kittywhiskers Van Gogh) d4b0fae merge bitcoin#26854: Fix intermittent timeout in p2p_permissions.py (Kittywhiskers Van Gogh) 892e329 merge bitcoin#26138: Avoid race in disconnect_nodes helper (Kittywhiskers Van Gogh) d6ce037 merge bitcoin#25443: Fail if connect_nodes fails (Kittywhiskers Van Gogh) 60b5392 partial bitcoin#22778: Reduce resource usage for inbound block-relay-only connections (Kittywhiskers Van Gogh) 85c4aef merge bitcoin#23774: Add missing assert_equal import to p2p_add_connections.py (Kittywhiskers Van Gogh) 0354417 merge bitcoin#22777: don't request tx relay on feeler connections (Kittywhiskers Van Gogh) 7229eb0 merge bitcoin#23042: Avoid logging AlreadyHaveTx when disconnecting misbehaving peer (Kittywhiskers Van Gogh) 05395ff merge bitcoin#22817: Avoid race after connect_nodes (Kittywhiskers Van Gogh) Pull request description: ## Additional Information * Depends on #6286 * Depends on #6287 * Depends on #6289 * When backporting [bitcoin#28287](bitcoin#28287), `p2p_net_deadlock.py` relies on the function, `random_bytes()`, that is introduced in [bitcoin#25625](bitcoin#25625). Backporting [bitcoin#25625](bitcoin#25625) would attract changes outside the scope of this PR. In the interest of brevity, the changes that introduce `random_bytes()` have been included in [bitcoin#28287](bitcoin#28287) instead. ## Breaking Changes None expected. ## Checklist: - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)** - [x] I have added or updated relevant unit/integration/functional/e2e tests - [x] I have made corresponding changes to the documentation **(note: N/A)** - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_ ACKs for top commit: UdjinM6: utACK e458adb PastaPastaPasta: utACK e458adb Tree-SHA512: 48494004dddecb31c53f5e19ab0114b92ed7b4381c7977800fd49b7403222badbfdcfe46241e854f5b086c6f54a35f6483f91c6f047b7ac9b1e88e35bb32ad02
Summary: It has been observed that if both sides of a P2P connection have a significant amount of data to send, a stall can occur, where both try to drain their own send queue before trying to receive. The same issue seems to apply to the current Bitcoin Core codebase, though I don't know whether it's a frequent issue for us. The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn't fully drain the send queue. This is a significant reduction in how aggressive the "receive pushback" mechanism is, because now it will only mildly push back while sending progress is made; if the other side stops receiving entirely, the pushback disappears. I don't think that's a serious problem though: - We still have a pushback mechanism at the application buffer level (when the application receive buffer overflows, receiving is paused until messages in the buffer get processed; waiting on our own net_processing thread, not on the remote party). - There are cases where the existing mechanism is too aggressive; e.g. when the send queue is non-empty, but tiny, and can be sent with a single send() call. In that case, I think we'd prefer to still receive within the same processing loop of the network thread. Co-authored-by: Anthony Towns <aj@erisian.com.au> core#28287: > rpc: add test-only sendmsgtopeer rpc > > This rpc can be used when we want a node to send a message, but > cannot use a python P2P object, for example for testing of low-level > net transport behavior. > test: add basic tests for sendmsgtopeer to rpc_net.py > test: add functional test for deadlock situation This is a backport of [[bitcoin/bitcoin#27981 | core#27981]] and [[bitcoin/bitcoin#28287 | core#28287]] Test Plan: `ninja all check-all` Test that this passes now and the same functional test sometimes fails before this commit (>33% failure rate on my machine) ` for i in {1..100}; do ~/dev/bitcoin-abc/build/test/functional/test_runner.py p2p_net_deadlock; done` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Sintayew4, Fabien Differential Revision: https://reviews.bitcoinabc.org/D17138
Summary: It has been observed that if both sides of a P2P connection have a significant amount of data to send, a stall can occur, where both try to drain their own send queue before trying to receive. The same issue seems to apply to the current Bitcoin Core codebase, though I don't know whether it's a frequent issue for us. The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn't fully drain the send queue. This is a significant reduction in how aggressive the "receive pushback" mechanism is, because now it will only mildly push back while sending progress is made; if the other side stops receiving entirely, the pushback disappears. I don't think that's a serious problem though: - We still have a pushback mechanism at the application buffer level (when the application receive buffer overflows, receiving is paused until messages in the buffer get processed; waiting on our own net_processing thread, not on the remote party). - There are cases where the existing mechanism is too aggressive; e.g. when the send queue is non-empty, but tiny, and can be sent with a single send() call. In that case, I think we'd prefer to still receive within the same processing loop of the network thread. Co-authored-by: Anthony Towns <aj@erisian.com.au> core#28287: > rpc: add test-only sendmsgtopeer rpc > > This rpc can be used when we want a node to send a message, but > cannot use a python P2P object, for example for testing of low-level > net transport behavior. > test: add basic tests for sendmsgtopeer to rpc_net.py > test: add functional test for deadlock situation This is a backport of [[bitcoin/bitcoin#27981 | core#27981]] and [[bitcoin/bitcoin#28287 | core#28287]] Test Plan: `ninja all check-all` Test that this passes now and the same functional test sometimes fails before this commit (>33% failure rate on my machine) ` for i in {1..100}; do ~/dev/bitcoin-abc/build/test/functional/test_runner.py p2p_net_deadlock; done` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Subscribers: Sintayew4, Fabien Differential Revision: https://reviews.bitcoinabc.org/D17138
This adds a
sendmsgtopeer
rpc (for testing only) that allows a node to send a message (provided in hex) to a peer.While we would usually use a
p2p
object instead of a node for this in the test framework, that isn't possible in situations where this message needs to trigger an actual interaction of multiple nodes.Use this rpc to add test coverage for the bug fixed in #27981 (that just got merged):
The test lets two nodes (almost) simultaneously send a single large (4MB) p2p message to each other, which would have caused a deadlock previously (making this test fail), but succeeds now.
As can be seen from the discussion in #27981, it was not easy to reproduce this bug without
sendmsgtopeer
. I would imagine thatsendmsgtopeer
could also be helpful in various other test constellations.