-
-
Notifications
You must be signed in to change notification settings - Fork 364
Description
Problem
Windows has 3 incompatible families of event notifications APIs: IOCP, select
/WSAPoll
, and WaitForMultipleEvents
-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you must use all 3 of these. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.
(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO – and I'm ignoring GUI events entirely because Trio isn't a GUI library. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't wrong but does require taking a certain narrow perspective...)
Considerations
The WaitFor*Event family
The Event
-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to use WaitForSingleObject(Ex)
and allocate one-thread-per-event, or else use WaitForMultipleObjects(Ex)
and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233
IOCP
IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)
Issues:
-
When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you must implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. ref1, ref2.)
-
When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. ref1, ref2)
-
Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.
-
Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding what bytes to write until the last moment.
-
Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.
On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's also possible to use a socket if you set it up right.)
select/WSAPoll
You can also use select
/WSAPoll
. This is the only documented way to check if a socket is readable/writable. However:
-
As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop,
select.select
on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference... -
select.select
is limited to 512 sockets, but this is trivially overcome; the Windowsfd_set
structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (Windows: wait_{read,writ}able limited to 512 sockets #3). (This is a nice side-effect of Windows never having had a dense fd space. This also meansWSAPoll
doesn't have much reason to exist. Unlike other platforms wherepoll
beatsselect
becausepoll
uses an array andselect
uses a bitmap,WSAPoll
is not really any more efficient thanselect
. Its only advantage is that it's similar to how poll works on other platforms... but it's gratuitously incompatible. The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, viaQueueUserAPC
.) -
Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.
-
select
only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.
Options
Given all of the above, our current design is a hybrid that uses select
and non-blocking IO for sockets, with IOCP available when needed. We run select
in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur. This is vastly simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modify select
's interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g. wait_writable
for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.
The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n) select
. However, as we can see above, there are a whole list of challenges that would need to be overcome first.
Working around IOCP's limitations
UDP sends
I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.
Spurious extra latency in TCP receives
I think that using the MSG_PUSH_IMMEDIATE
flag should solve this.
Checking readability / writability
It turns out that IOCP actually can check readability! It's not mentioned on MSDN at all, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. ref1 (← official MS docs! also note this is ch. 6 of "NPfMW", referenced below), ref2, ref3.
That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with MSG_PEEK
set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this doesn't work (MSG_PEEK
and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.
What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns EWOULDBLOCK
. (This is weird; other platforms don't do this.) If this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.
For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.
Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: ref1, ref2, ref3, ref4, ref5. I guess this is how select
is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]
Implementing all this junk
I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809 and its parent.
TODO
If we do want to switch to using IOCP in general, then the sequence would go something like:
-
check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends withMSG_PARTIAL
does too while we're at it) -
check whether you really can do zero-byte reads on UDP sockets like libuv claims -
figure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't dowait_socket_writable
on UDP sockets)
At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:
-
migrate away from[Not necessary, AFD-basedselect
for the cases that can't use IOCP readable/writable checking:select
should work for these too]-
connect -
accept
-
-
implementwait_socket_readable
andwait_socket_writable
on top of IOCP and get rid ofselect
(but at this point we're still doing non-blocking I/O on sockets, just using IOCP as aselect
replacement) -
(optional / someday) switch to using IOCP for everything instead of non-blocking I/O
New plan:
- Use the tricks from the thread below to reimplement
wait_socket_{readable,writable}
using AFD, and confirm it works - Add LSP testing to our Windows CI
- Consider whether we want to switch to using IOCP in more cases, e.g. send/recv. Not sure it's worth bothering.