Skip to content

Conversation

sipa
Copy link
Member

@sipa sipa commented Jun 26, 2023

See ElementsProject/elements#1233. There, it has been observed that if both sides of a P2P connection have a significant amount of data to send, a stall can occur, where both try to drain their own send queue before trying to receive. The same issue seems to apply to the current Bitcoin Core codebase, though I don't know whether it's a frequent issue for us.

The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn't fully drain the send queue.

This is a significant reduction in how aggressive the "receive pushback" mechanism is, because now it will only mildly push back while sending progress is made; if the other side stops receiving entirely, the pushback disappears. I don't think that's a serious problem though:

  • We still have a pushback mechanism at the application buffer level (when the application receive buffer overflows, receiving is paused until messages in the buffer get processed; waiting on our own net_processing thread, not on the remote party).
  • There are cases where the existing mechanism is too aggressive; e.g. when the send queue is non-empty, but tiny, and can be sent with a single send() call. In that case, I think we'd prefer to still receive within the same processing loop of the network thread.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 26, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK mzumsande, ajtowns, naumenkogs

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28222 (Use shared_ptr for CNode inside CConnman by willcl-ark)
  • #28196 (BIP324 connection support by sipa)
  • #28165 (net: transport abstraction by sipa)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@sipa
Copy link
Member Author

sipa commented Jun 26, 2023

cc @psgreco, who pointed to the issue, and helped test it.

@psgreco
Copy link
Contributor

psgreco commented Jun 26, 2023

concept ack 5e92378, this potential issue is not easy to trigger on demand in bitcoin, but it's relatively easy to trigger in elements, when the node is roughly 20/24 hours behind. Tested in elements a similar version of the patch it does solve the stalling

@kristapsk
Copy link
Contributor

Wouldn't it be possible to trigger and test this with some functional test?

@psgreco
Copy link
Contributor

psgreco commented Jun 27, 2023

In theory, it should be, but in our tests (mostly @lolhill 's) a good component of this situation is latency. I've never been able to replicate this between 2 local hosts, always with a host that's ~100ms away.

@glozow glozow added the P2P label Jun 29, 2023
Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core issue is that whenever our optimistic send fails to fully send a message, we do subsequently not even select() for receiving; if it then turns out that sending is not possible either, no progress is made at all. To address this, the solution used in this PR is to still select() for both sending and receiving when an optimistic send fails, but skip receiving if sending succeeded, and (still) doesn't fully drain the send queue.

AIUI (correct me if I'm wrong!) the backpressure we do is:

  • fPauseSend -- we won't do ProcessMessage (which would probably cause us to generate new data to send) when this is set, which is when we have more than 1MB of data lined up. (However we will do SendMessages, which generates relatively small messages like INVs and PING and GETDATA)
  • fPauseRecv -- we won't select the socket for reading any more once we've got more than 5MB in parsed messages queued up to process
  • prefer sending over receiving -- if we've got data to send, we'll prioritise sending it, even if we're making no forward progress and could receive messages, to the point where we don't even check (via select/poll) to see if there's any data to receive when we've got data to send

This patches changes the latter logic to be:

  • prefer sending over receiving -- always see if we can send/receive data, but don't actually try to receive data until either (a) we don't have data to send in the first place, (b) we can't send any data, or (c) we've successfully sent all the data we have.

This seems pretty sensible to me: this is a peer to peer protocol, so it seems to me we should be making progress in parallel on sending and receiving whenever possible -- your send is my receive after all.

Approach ACK.

LOCK(pnode->cs_vSend);
select_send = !pnode->vSendMsg.empty();
}
bool select_send = WITH_LOCK(pnode->cs_vSend, return !pnode->vSendMsg.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to introduce a method bool CNode::WantsToSend() const { return !pnode->vSendMsg.empty(); } method, and use that here and instead of returning a pair<X, bool> above?

Copy link
Member Author

@sipa sipa Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd mean grabbing the lock twice, no? I added it to SocketSendData because the cs_vSend lock is already grabbed to call that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you make it WantsToSend() EXCLUSIVE_LOCK_REQUIRED(cs_vSend) and require the caller to have the lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your consideration: ajtowns@87c509c . I expanded the WITH_LOCK out in GenerateWaitSockets because I thought that was clearer than trying to make it an expression. Keeps the signature of SocketSendData the same, doesn't add any additional locking, and avoids the dummy data_left in PushMessage.

@naumenkogs
Copy link
Member

Approach ACK

Co-authored-by: Anthony Towns <aj@erisian.com.au>
@sipa sipa force-pushed the 202306_pushback branch from 5e92378 to 3388e52 Compare July 20, 2023 14:36
@sipa
Copy link
Member Author

sipa commented Jul 20, 2023

@psgreco See above; it turned out that what I intended to do here wasn't actually what was implemented (it was instead unconditionally preferring send over receive). Would you mind trying again if this fixes the issue for you?

@mzumsande
Copy link
Contributor

mzumsande commented Jul 21, 2023

I wrote a functional test, see https://github.com/mzumsande/bitcoin/tree/test_sipa_netstalling (because of the 1st commit obviously not intended for merge, but it makes it possible to reproduce the problem).
It works by mining a few large blocks, and having two nodes exchange these blocks in both directions by repeated getblockfrompeer calls, and then check whether the deadlock happened.

Unfortunately, the current branch doesn't appear to fix the problem completely, the test fails for me both here and on master:
When the situation is reached where we now select for both sending and receiving (because our peer doesn't receive any data), we try to resolve the deadlock by now also receiving.
This works for a little while - however, if our send buffer is full, fPauseSend will be set, and because of that we skip early in ProcessMessages() and don't call PollMessage() anymore. Therefore the received data will pile up without being cleared by net_processing. When pnode->m_msg_process_queue_size becomes too large (5MB), fPauseRecv will also be set, and after that we are again in a deadlock situation where both peers are sending and none is receiving. I could observe this with the added logging in the 3rd commit in my branch.

Not sure how to best fix this...

@ajtowns
Copy link
Contributor

ajtowns commented Jul 21, 2023

I think fPauseSend getting set on both sides and causing a deadlock should probably be out of scope for this PR -- at least as I understand it, this fixes an issue where we get a deadlock even without fPauseSend triggering.

I think the scenario here is:

  • peer A sends a 2MB message to peer B. This fills up B's socket receive buffer (200kB?) and A's socket send buffer (200kB?) without completing. A still has 1.6MB to send to B, so stops reading from the socket.
  • peer B does the same thing at almost exactly the same time, with the same result.
  • A/B are deadlocked.

Maybe adding a debug-only sendp2pmsg rpc would be the easiest way to simulate this and be useful for debugging p2p things in general?

If we do want to address fPauseSend deadlocking, a few approaches come to mind:

  1. easy: make fPauseSend a timestamp rather than a boolean, and if it's been set for >5 minutes, disconnect. doesn't prevent the deadlock, but at least frees up the connection slot and makes it possible to try again.
  2. hard: rework net_processing so that we keep making as much progress as we can -- eg, change fPauseSend to continue processing incoming block or tx messages, but to skip GETDATA messages and to defer sending out INV messages and the like, so that we're draining as much traffic as we can, while limiting how much gets added.
  3. impossible? add more dynamic traffic throttling: if you're bandwidth limited and getting too much TX traffic, perhaps you should be raising your feefilter level even if your mempool isn't full? I don't see how to generalise that if it's blocks or header messages or something else that cause a problem though.

@psgreco
Copy link
Contributor

psgreco commented Jul 21, 2023

@psgreco See above; it turned out that what I intended to do here wasn't actually what was implemented (it was instead unconditionally preferring send over receive). Would you mind trying again if this fixes the issue for you?

It seems to fix the issue for me still with the new changes, but the refactor that I had to do to run in elements 22 (like bitcoin 22), doesn't let me make a hard confirmation.

@sipa
Copy link
Member Author

sipa commented Jul 24, 2023

@ajtowns @mzumsande Thanks, so it appears there are actually two mostly-unrelated network buffer deadlock issues, and Martin's test is likely triggering both of them.

I agree with AJ that we should still fix the network side one, even if we can't (or don't want to address) the application buffer side one. Fixing the application buffer side one indeed seems a lot harder, and probably needs discussion beyond this PR.

It would be good to have a test for the network side one, without it also triggering the application side one, to verify this PR actually fixes something. Especially as I don't understand @psgreco's earlier observation (where an older version of this PR unconditionally preferred sending, which shouldn't improve the situation at all) as a means to validate it.

@mzumsande
Copy link
Contributor

Thanks, so it appears there are actually two mostly-unrelated network buffer deadlock issues, and Martin's test is likely triggering both of them.

So far, I haven't been able to trigger the original deadlock issue in my test when I run it on master - only the other one described above.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested ACK 3388e52

I now managed to reproduce the deadlock described in the OP by

1.) adding a debug-only sendmsgtopeer rpc as suggested by @ajtowns above (mzumsande@9c90e5d)

2.) Creating a functional test that uses this rpc to simultaneously have two nodes send a large message (4MB) to each other. (mzumsande@70e6752)

The added test creates the situation described above and fails on master and succeeds on this branch.

If you want, feel free to include the commits from 202208_test_sendmsg - alternatively, if you'd prefer not to deal with test / rpc feedback here, I'd also be happy to open a separate PR that builds on this branch.

@ajtowns
Copy link
Contributor

ajtowns commented Aug 17, 2023

ACK 3388e52

Test case looks good; seems reasonable to do the sendmsgtopeer change in a separate PR though.

@naumenkogs
Copy link
Member

ACK 3388e52

@Sjors
Copy link
Member

Sjors commented Aug 18, 2023

Post merge concept ACK. From my (very) limited understanding of sockets, this makes sense. Thanks @ajtowns for the description #27981 (comment).

achow101 added a commit that referenced this pull request Aug 24, 2023
…evel deadlock situation

b3a93b4 test: add functional test for deadlock situation (Martin Zumsande)
3557aa4 test: add basic tests for sendmsgtopeer to rpc_net.py (Martin Zumsande)
a9a1d69 rpc: add test-only sendmsgtopeer rpc (Martin Zumsande)

Pull request description:

  This adds a `sendmsgtopeer` rpc (for testing only) that allows a node to send a message (provided in hex) to a peer.
  While we would usually use a `p2p` object instead of a node for this in the test framework, that isn't possible in situations where this message needs to trigger an actual interaction of multiple nodes.

  Use this rpc to add test coverage for the bug fixed in #27981 (that just got merged):
  The test lets two nodes (almost) simultaneously send a single large (4MB) p2p message to each other, which would have caused a deadlock previously (making this test fail), but succeeds now.

  As can be seen from the discussion in #27981, it was not easy to reproduce this bug without `sendmsgtopeer`. I would imagine that `sendmsgtopeer` could also be helpful in various other test constellations.

ACKs for top commit:
  ajtowns:
    ACK b3a93b4
  sipa:
    ACK b3a93b4
  achow101:
    ACK b3a93b4

Tree-SHA512: 6e22e72402f3c4dd70cddb9e96ea988444720f7a164031df159fbdd48056c8ac77ac53def045d9208a3ca07437c7c8e34f8b4ebc7066c0a84d81cd53f2f4fa5f
luke-jr pushed a commit to luke-jr/bitcoin that referenced this pull request Aug 29, 2023
Co-authored-by: Anthony Towns <aj@erisian.com.au>

Github-Pull: bitcoin#27981
Rebased-From: 3388e52
Frank-GER pushed a commit to syscoin/syscoin that referenced this pull request Sep 8, 2023
…r net-level deadlock situation

b3a93b4 test: add functional test for deadlock situation (Martin Zumsande)
3557aa4 test: add basic tests for sendmsgtopeer to rpc_net.py (Martin Zumsande)
a9a1d69 rpc: add test-only sendmsgtopeer rpc (Martin Zumsande)

Pull request description:

  This adds a `sendmsgtopeer` rpc (for testing only) that allows a node to send a message (provided in hex) to a peer.
  While we would usually use a `p2p` object instead of a node for this in the test framework, that isn't possible in situations where this message needs to trigger an actual interaction of multiple nodes.

  Use this rpc to add test coverage for the bug fixed in bitcoin#27981 (that just got merged):
  The test lets two nodes (almost) simultaneously send a single large (4MB) p2p message to each other, which would have caused a deadlock previously (making this test fail), but succeeds now.

  As can be seen from the discussion in bitcoin#27981, it was not easy to reproduce this bug without `sendmsgtopeer`. I would imagine that `sendmsgtopeer` could also be helpful in various other test constellations.

ACKs for top commit:
  ajtowns:
    ACK b3a93b4
  sipa:
    ACK b3a93b4
  achow101:
    ACK b3a93b4

Tree-SHA512: 6e22e72402f3c4dd70cddb9e96ea988444720f7a164031df159fbdd48056c8ac77ac53def045d9208a3ca07437c7c8e34f8b4ebc7066c0a84d81cd53f2f4fa5f
kwvg added a commit to kwvg/dash that referenced this pull request Aug 4, 2024
marking as partial as it should be revisited when bitcoin#24356 is
backported
kwvg added a commit to kwvg/dash that referenced this pull request Aug 4, 2024
marking as partial as it should be revisited when bitcoin#24356 is
backported
kwvg added a commit to kwvg/dash that referenced this pull request Aug 6, 2024
Also make `v`{`Receivable`, `Sendable`, `Error`}`Nodes` `std::set`s so
that bitcoin#27981 can remove a node from `vReceivableNodes` if there is
data left to send (we already do this by checking against `vSendMsg`
through `nSendMsgSize` but this doesn't account for leftover data
reported by `SocketSendData`, which the backport does).
kwvg added a commit to kwvg/dash that referenced this pull request Aug 6, 2024
Marking as partial as it should be revisited when bitcoin#24356 is
backported
kwvg added a commit to kwvg/dash that referenced this pull request Aug 6, 2024
Also make `v`{`Receivable`, `Sendable`, `Error`}`Nodes` `std::set`s so
that bitcoin#27981 can remove a node from `vReceivableNodes` if there is
data left to send (we already do this by checking against `vSendMsg`
through `nSendMsgSize` but this doesn't account for leftover data
reported by `SocketSendData`, which the backport does).
kwvg added a commit to kwvg/dash that referenced this pull request Aug 6, 2024
Marking as partial as it should be revisited when bitcoin#24356 is
backported
kwvg added a commit to kwvg/dash that referenced this pull request Aug 13, 2024
Also make `v`{`Receivable`, `Sendable`, `Error`}`Nodes` `std::set`s so
that bitcoin#27981 can remove a node from `vReceivableNodes` if there is
data left to send (we already do this by checking against `vSendMsg`
through `nSendMsgSize` but this doesn't account for leftover data
reported by `SocketSendData`, which the backport does).
kwvg added a commit to kwvg/dash that referenced this pull request Aug 13, 2024
Marking as partial as it should be revisited when bitcoin#24356 is
backported
@bitcoin bitcoin locked and limited conversation to collaborators Aug 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants