-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Net: Pass MSG_MORE
flag when sending non-final network messages (round 2)
#26844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
Since Nagle's algorithm is disabled, each and every call to send(2) can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload. Linux implements a MSG_MORE flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to send(2). Where available, specify this flag when calling send(2) in CConnman::SocketSendData(CNode &) if the data buffer being sent is not the last one in node.vSendMsg.
At a first glance I would ask why we can't do what was suggested here: #12519 (comment), if that is preferable? At least back then, it sounded like we could just pick 77cee16 without the more involved changes in #11227. (Not advocating against your change, just wondering why you have chosen to go with the less preferable version when both seem technically possible) Also pinging @vasild who would likely be a good reviewer for this. |
Because my change is tiny, and that other change is involved. I want the change to actually be merged, not bike-shedded indefinitely. Smaller changes are easier to validate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to demonstrate that this actually brings some benefit and is not just code clutter.
Maybe compare tcpdump
output with and without this patch?
if (it + 1 != node.vSendMsg.end()) { | ||
flags |= MSG_MORE; | ||
} | ||
#endif | ||
nBytes = node.m_sock->Send(reinterpret_cast<const char*>(data.data()) + node.nSendOffset, data.size() - node.nSendOffset, flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this can't send the full message one of the two break
s below will be executed which may result in this function returning with all Send()
calls using MSG_MORE
. For example: let there be 3 messages, 10 bytes each. Send()
of the first one will use MSG_MORE
, let it return 10 (full message sent). Send()
of the second one will use MSG_MORE
, let it return any number in [0..9]. Then this function will return, without ever calling Send()
without MSG_MORE
and the first message may remain buffered/unsent somewhere in the OS.
I think this is unwanted effect, defeating TCP_NODELAY
and the check if (it + 1 != node.vSendMsg.end())
.
I wonder if it is possible to tell the OS that we are done, without sending further data? Maybe Send(nullptr, 0, MSG_NOSIGNAL | MSG_DONTWAIT); // without MSG_MORE
would do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not all of our pending outgoing messages can fit in the socket's transmit buffer, then we want all of our calls to send
to specify MSG_MORE
. The kernel will immediately (without delay) transmit as many full segments as it can assemble and that will fit in the congestion window. It will not immediately transmit a partial segment but instead will call back to us (via returning a writable status from select
, poll
, epoll
, etc.) to fetch more data to transmit. If we have no more data to transmit, then yes, calling send
with a zero length and without MSG_MORE
will tell the kernel that we have nothing left to send, and it will transmit a partial segment as soon as such segment will fit in the congestion window.
In your example, if the second of the three 10-byte messages can only partially be copied to the socket transmit buffer, then it must be the case that the transmit buffer is now full. The transmit buffer is always larger than MSS, the maximum segment size, meaning the kernel can now assemble a full segment, which it will transmit as soon as the window allows. It will then indicate the socket is writable again, and we will be called again to write more of the second message.
So, in short, your concern that we will unnecessarily delay transmitting data is unwarranted. The kernel never delays transmitting a full segment if it fits in the congestion window. The kernel does delay transmitting a partial segment that fits in the congestion window if the last call to send
on that socket specified MSG_MORE
, but that's exactly what we want, as we want the opportunity to write more data into the transmit buffer so the kernel will potentially transmit fewer and larger segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If send(2)
writes less bytes than requested, does it follow that the transmit buffer is full? I am not sure about that. Is it documented somewhere? What about a signal interrupting the send midway (or even before it started)? Is that not a case where send(2)
writes less bytes than requested, but the transmit buffer is not full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a signal interrupting the send midway (or even before it started)?
I'm pretty sure that's only possible if the socket is in blocking mode. But regardless, if send
returns a short write due to invoking a signal handler, then the socket is still writable, and the user code will call send
again without delay (i.e., on the next iteration of the I/O loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be after accepting new connections, sending, receiving and parsing messages from all other (possibly 100s) connections. I guess this would be rare and even when it happens would probably have no adverse effects on the communication, but I would feel more comfortable if this PR does not introduce that. If a dummy Send()
without MSG_MORE
before the two break
s is a way to avoid that, then I think it should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented
Even better, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand now.
FWIW, I switched my (public, well-connected) node to (a rebased version of) this PR, together with @vasild's proposed patch above + logging whenever these empty non-MSG_MORE
message would be sent.
On this node with ~250 connections, in 16 hours time, having uploaded 7.2 GB and downloaded 1.2 GB, 221 times a MSG_MORE
was interrupted (always after sending some bytes, usually at least a few kB; never with having sent nothing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
221 times a
MSG_MORE
was interrupted
I think what you may have meant was that a send
call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).
The real test would be querying the SIOCOUTQ
ioctl on the socket the next time the network I/O loop gets around to servicing it. That's the number of untransmitted bytes sitting in the kernel's transmit buffer. If that number is smaller than the connection's MSS, then the kernel exhausted its transmit buffer and started to delay sending the remaining bytes due to MSG_MORE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never with having sent nothing
N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made. Again, I'm pretty sure that even signal handlers won't interrupt a non-blocking write since the kernel only enters an interruptible wait state during a blocking write.
(Edit: Actually, there may be an edge case where a socket indicates writable despite having no free space in its transmit buffer: specifically the case where the next write call would immediately return an error without copying any bytes. That would be a fatal error on the socket, though, such as "connection reset.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you may have meant was that a send call indicated that fewer bytes were copied than were requested, not that the call was interrupted (by a signal handler).
Yes, exactly - I tried to be succinct, but that may have been confusing. Specifically, it refers to fewer bytes being accepted by send()
than handed to it, while MSG_MORE
was set.
N.B.: It would be impossible for zero bytes to have been copied to a socket in non-blocking mode if that socket had indicated that it was writable and no other thread had written to it since such indication was made.
Indeed, so that sounds very much expected that 0 never happened (except when the socket errorred and the connected closed, but in that case, there is obviously no point in adding a non-MSG_MORE
padding either).
I'd like someone else with experience in low-level network I/O programming to comment before I'd go to the effort to produce "proof" of the merits of using |
Concept ACK. Unless someone is actively pursuing a PR to switch to gathering writes, I don't see any reason not to start using I also find the justification that this will never do something undesirable fairly convincing. |
Concept ACK Below is some basic demonstration of master
tcpdump:
PR
tcpdump:
(notice the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 691eaf8
I think this is good to go as it is and the above discussion is not a blocker. It would be nice to get input from other reviewers on it.
Thanks!
… network messages (round 2) 691eaf8 Pass MSG_MORE flag when sending non-final network messages (Matt Whitlock) Pull request description: **N.B.:** This is my second attempt at introducing this optimization. bitcoin#12519 (2018) was closed in deference to switching to doing gathering socket writes using `sendmsg(2)`, which I agree would have superior performance due to fewer syscalls, but that work was apparently abandoned in late 2018. Ever since, Bitcoin Core has continued writing tons of runt packets to the wire. Can we proceed with my halfway solution for now? ---- Since Nagle's algorithm is disabled, each and every call to `send(2)` can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload. Linux implements a `MSG_MORE` flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls to `send(2)`. Where available, specify this flag when calling `send(2)` in `CConnman::SocketSendData(CNode &)` if the data buffer being sent is not the last one in `node.vSendMsg`. ACKs for top commit: sipa: ACK 691eaf8 vasild: ACK 691eaf8 Tree-SHA512: 9a7f46bc12edbf78d488f05d1c46760110a24c95af74b627d2604fcd198fa3f511c5956bac36d0034e88c632d432f7d394147e667a11b027af0a30f70a546d70
N.B.: This is my second attempt at introducing this optimization. #12519 (2018) was closed in deference to switching to doing gathering socket writes using
sendmsg(2)
, which I agree would have superior performance due to fewer syscalls, but that work was apparently abandoned in late 2018. Ever since, Bitcoin Core has continued writing tons of runt packets to the wire. Can we proceed with my halfway solution for now?Since Nagle's algorithm is disabled, each and every call to
send(2)
can potentially generate a separate TCP segment on the wire. This is especially inefficient when sending the tiny header preceding each message payload.Linux implements a
MSG_MORE
flag that tells the kernel not to push the passed data immediately to the connected peer but rather to collect it in the socket's internal transmit buffer where it can be combined with data from successive calls tosend(2)
. Where available, specify this flag when callingsend(2)
inCConnman::SocketSendData(CNode &)
if the data buffer being sent is not the last one innode.vSendMsg
.