Skip to content

Conversation

whitslack
Copy link
Contributor

N.B.: This is my second attempt at introducing this optimization. #12519 (2018) was closed in deference to switching to doing gathering socket writes using sendmsg(2), which I agree would have superior performance due to fewer syscalls, but that work was apparently abandoned in late 2018. Ever since, Bitcoin Core has continued writing tons of runt packets to the wire. Can we proceed with my halfway solution for now?


Since Nagle's algorithm is disabled, each and every call to send(2) can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.

Linux implements a MSG_MORE flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to send(2). Where available, specify this flag when calling send(2) in CConnman::SocketSendData(CNode &) if the data buffer being sent is not the last one in node.vSendMsg.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 7, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK vasild, sipa

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@DrahtBot DrahtBot added the P2P label Jan 7, 2023
Since Nagle's algorithm is disabled, each and every call to send(2) can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.

Linux implements a MSG_MORE flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to send(2). Where available, specify this flag when calling send(2) in CConnman::SocketSendData(CNode &) if the data buffer being sent is not the last one in node.vSendMsg.
@dergoegge
Copy link
Member

At a first glance I would ask why we can't do what was suggested here: #12519 (comment), if that is preferable? At least back then, it sounded like we could just pick 77cee16 without the more involved changes in #11227. (Not advocating against your change, just wondering why you have chosen to go with the less preferable version when both seem technically possible)

Also pinging @vasild who would likely be a good reviewer for this.

@whitslack
Copy link
Contributor Author

just wondering why you have chosen to go with the less preferable version when both seem technically possible

Because my change is tiny, and that other change is involved. I want the change to actually be merged, not bike-shedded indefinitely. Smaller changes are easier to validate.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to demonstrate that this actually brings some benefit and is not just code clutter.
Maybe compare tcpdump output with and without this patch?

Comment on lines +807 to +811
if (it + 1 != node.vSendMsg.end()) {
flags |= MSG_MORE;
}
#endif
nBytes = node.m_sock->Send(reinterpret_cast<const char*>(data.data()) + node.nSendOffset, data.size() - node.nSendOffset, flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can't send the full message one of the two breaks below will be executed which may result in this function returning with all Send() calls using MSG_MORE. For example: let there be 3 messages, 10 bytes each. Send() of the first one will use MSG_MORE, let it return 10 (full message sent). Send() of the second one will use MSG_MORE, let it return any number in [0..9]. Then this function will return, without ever calling Send() without MSG_MORE and the first message may remain buffered/unsent somewhere in the OS.

I think this is unwanted effect, defeating TCP_NODELAY and the check if (it + 1 != node.vSendMsg.end()).

I wonder if it is possible to tell the OS that we are done, without sending further data? Maybe Send(nullptr, 0, MSG_NOSIGNAL | MSG_DONTWAIT); // without MSG_MORE would do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not all of our pending outgoing messages can fit in the socket's transmit buffer, then we want all of our calls to send to specify MSG_MORE. The kernel will immediately (without delay) transmit as many full segments as it can assemble and that will fit in the congestion window. It will not immediately transmit a partial segment but instead will call back to us (via returning a writable status from select, poll, epoll, etc.) to fetch more data to transmit. If we have no more data to transmit, then yes, calling send with a zero length and without MSG_MORE will tell the kernel that we have nothing left to send, and it will transmit a partial segment as soon as such segment will fit in the congestion window.

In your example, if the second of the three 10-byte messages can only partially be copied to the socket transmit buffer, then it must be the case that the transmit buffer is now full. The transmit buffer is always larger than MSS, the maximum segment size, meaning the kernel can now assemble a full segment, which it will transmit as soon as the window allows. It will then indicate the socket is writable again, and we will be called again to write more of the second message.

So, in short, your concern that we will unnecessarily delay transmitting data is unwarranted. The kernel never delays transmitting a full segment if it fits in the congestion window. The kernel does delay transmitting a partial segment that fits in the congestion window if the last call to send on that socket specified MSG_MORE, but that's exactly what we want, as we want the opportunity to write more data into the transmit buffer so the kernel will potentially transmit fewer and larger segments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If send(2) writes less bytes than requested, does it follow that the transmit buffer is full? I am not sure about that. Is it documented somewhere? What about a signal interrupting the send midway (or even before it started)? Is that not a case where send(2) writes less bytes than requested, but the transmit buffer is not full?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a signal interrupting the send midway (or even before it started)?

I'm pretty sure that's only possible if the socket is in blocking mode. But regardless, if send returns a short write due to invoking a signal handler, then the socket is still writable, and the user code will call send again without delay (i.e., on the next iteration of the I/O loop).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be after accepting new connections, sending, receiving and parsing messages from all other (possibly 100s) connections. I guess this would be rare and even when it happens would probably have no adverse effects on the communication, but I would feel more comfortable if this PR does not introduce that. If a dummy Send() without MSG_MORE before the two breaks is a way to avoid that, then I think it should be added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented

Even better, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand now.

FWIW, I switched my (public, well-connected) node to (a rebased version of) this PR, together with @vasild's proposed patch above + logging whenever these empty non-MSG_MORE message would be sent.

On this node with ~250 connections, in 16 hours time, having uploaded 7.2 GB and downloaded 1.2 GB, 221 times a MSG_MORE was interrupted (always after sending some bytes, usually at least a few kB; never with having sent nothing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

221 times a MSG_MORE was interrupted

I think what you may have meant was that a send call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).

The real test would be querying the SIOCOUTQ ioctl on the socket the next time the network I/O loop gets around to servicing it. That's the number of untransmitted bytes sitting in the kernel's transmit buffer. If that number is smaller than the connection's MSS, then the kernel exhausted its transmit buffer and started to delay sending the remaining bytes due to MSG_MORE.

Copy link
Contributor Author

@whitslack whitslack Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never with having sent nothing

N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made. Again, I'm pretty sure that even signal handlers won't interrupt a non-blocking write since the kernel only enters an interruptible wait state during a blocking write.

(Edit: Actually, there may be an edge case where a socket indicates writable despite having no free space in its transmit buffer: specifically the case where the next write call would immediately return an error without copying any bytes. That would be a fatal error on the socket, though, such as "connection reset.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you may have meant was that a send call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).

Yes, exactly - I tried to be succinct, but that may have been confusing. Specifically, it refers to fewer bytes being accepted by send() than handed to it, while MSG_MORE was set.

N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made.

Indeed, so that sounds very much expected that 0 never happened (except when the socket errorred and the connected closed, but in that case, there is obviously no point in adding a non-MSG_MORE padding either).

@whitslack
Copy link
Contributor Author

Would be nice to demonstrate that this actually brings some benefit and is not just code clutter. Maybe compare tcpdump output with and without this patch?

tcpdump is what originally alerted me to the presence of this anti-pattern in Bitcoin Core years ago. It was the reason I looked into the code and opened my original PR to address the problem.

I'd like someone else with experience in low-level network I/O programming to comment before I'd go to the effort to produce "proof" of the merits of using MSG_MORE. That flag does exist for good reason.

@sipa
Copy link
Member

sipa commented Jan 30, 2023

Concept ACK.

Unless someone is actively pursuing a PR to switch to gathering writes, I don't see any reason not to start using MSG_MORE if it's known to be beneficial.

I also find the justification that this will never do something undesirable fairly convincing.

@vasild
Copy link
Contributor

vasild commented Jan 31, 2023

Concept ACK

Below is some basic demonstration of MSG_MORE (tldr - it works!). The logs show how send(2) is called and there is the relevant tcpdump(1) output. This is during the initial handshake up to and including verack.

master

calling Send(24) without MSG_MORE
calling Send(103) without MSG_MORE
calling Send(24) without MSG_MORE
calling Send(24) without MSG_MORE
calling Send(24) without MSG_MORE

tcpdump:

IP master.17807 > random_node.8333: Flags [S], seq 950983330, win 64240, options [mss 1420,sackOK,TS val 4155972266 ecr 0,nop,wscale 7], length 0
IP master.17807 > random_node.8333: Flags [.], ack 3334216285, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 0
* IP master.17807 > random_node.8333: Flags [P.], seq 0:24, ack 1, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 24
* IP master.17807 > random_node.8333: Flags [P.], seq 24:127, ack 1, win 502, options [nop,nop,TS val 4155972306 ecr 1548157682], length 103
IP master.17807 > random_node.8333: Flags [.], ack 25, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 128, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 152, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [.], ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 0
IP master.17807 > random_node.8333: Flags [P.], seq 127:151, ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 24
IP master.17807 > random_node.8333: Flags [P.], seq 151:175, ack 176, win 502, options [nop,nop,TS val 4155972356 ecr 1548157732], length 24
IP master.17807 > random_node.8333: Flags [P.], seq 175:199, ack 176, win 502, options [nop,nop,TS val 4155972357 ecr 1548157732], length 24

PR

>>>> calling Send(24) with MSG_MORE
>>>> calling Send(103) without MSG_MORE
>>>> calling Send(24) without MSG_MORE
>>>> calling Send(24) without MSG_MORE
>>>> calling Send(24) without MSG_MORE

tcpdump:

IP pr.17827 > random_node.8333: Flags [S], seq 2038199236, win 64240, options [mss 1420,sackOK,TS val 4155323939 ecr 0,nop,wscale 7], length 0
IP pr.17827 > random_node.8333: Flags [.], ack 3514299724, win 502, options [nop,nop,TS val 4155323980 ecr 4260101434], length 0
* IP pr.17827 > random_node.8333: Flags [P.], seq 0:127, ack 1, win 502, options [nop,nop,TS val 4155324474 ecr 4260101434], length 127
IP pr.17827 > random_node.8333: Flags [.], ack 128, win 502, options [nop,nop,TS val 4155324509 ecr 4260101964], length 0
IP pr.17827 > random_node.8333: Flags [.], ack 152, win 502, options [nop,nop,TS val 4155324509 ecr 4260101964], length 0
IP pr.17827 > random_node.8333: Flags [P.], seq 127:151, ack 152, win 502, options [nop,nop,TS val 4155324510 ecr 4260101964], length 24
IP pr.17827 > random_node.8333: Flags [P.], seq 151:175, ack 176, win 502, options [nop,nop,TS val 4155324510 ecr 4260101964], length 24
IP pr.17827 > random_node.8333: Flags [.], ack 200, win 502, options [nop,nop,TS val 4155324554 ecr 4260101974], length 0
IP pr.17827 > random_node.8333: Flags [P.], seq 175:199, ack 200, win 502, options [nop,nop,TS val 4155324558 ecr 4260102014], length 24

(notice the length of the packets marked with *, at the end of each line).

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 691eaf8

I think this is good to go as it is and the above discussion is not a blocker. It would be nice to get input from other reviewers on it.

Thanks!

@sipa
Copy link
Member

sipa commented Feb 10, 2023

ACK 691eaf8

Post-BIP324 (#24545 etc), this is also a privacy issue (as hiding packet boundaries against passive observers is a design goal).

@fanquake fanquake added this to the 25.0 milestone Feb 13, 2023
@fanquake fanquake merged commit 5ecd14a into bitcoin:master Feb 15, 2023
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Feb 16, 2023
… network messages (round 2)

691eaf8 Pass MSG_MORE flag when sending non-final network messages (Matt Whitlock)

Pull request description:

  **N.B.:** This is my second attempt at introducing this optimization. bitcoin#12519 (2018) was closed in deference to switching to doing gathering socket writes using `sendmsg(2)`, which I agree would have superior performance due to fewer syscalls, but that work was apparently abandoned in late 2018. Ever since, Bitcoin Core has continued writing tons of runt packets to the wire. Can we proceed with my halfway solution for now?

  ----

  Since Nagle's algorithm is disabled, each and every call to `send(2)` can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.

  Linux implements a `MSG_MORE` flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to `send(2)`. Where available, specify this flag when calling `send(2)` in `CConnman::SocketSendData(CNode &)` if the data buffer being sent is not the last one in `node.vSendMsg`.

ACKs for top commit:
  sipa:
    ACK 691eaf8
  vasild:
    ACK 691eaf8

Tree-SHA512: 9a7f46bc12edbf78d488f05d1c46760110a24c95af74b627d2604fcd198fa3f511c5956bac36d0034e88c632d432f7d394147e667a11b027af0a30f70a546d70
@bitcoin bitcoin locked and limited conversation to collaborators Feb 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants