Skip to content

Windows event notification #52

@njsmith

Description

@njsmith

Problem

Windows has 3 incompatible families of event notifications APIs: IOCP, select/WSAPoll, and WaitForMultipleEvents-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you must use all 3 of these. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.

(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO – and I'm ignoring GUI events entirely because Trio isn't a GUI library. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't wrong but does require taking a certain narrow perspective...)

Considerations

The WaitFor*Event family

The Event-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to use WaitForSingleObject(Ex) and allocate one-thread-per-event, or else use WaitForMultipleObjects(Ex) and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233

IOCP

IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)

Issues:

  • When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you must implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. ref1, ref2.)

  • When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. ref1, ref2)

  • Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.

  • Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding what bytes to write until the last moment.

  • Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.

On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's also possible to use a socket if you set it up right.)

select/WSAPoll

You can also use select/WSAPoll. This is the only documented way to check if a socket is readable/writable. However:

  • As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop, select.select on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference...

  • select.select is limited to 512 sockets, but this is trivially overcome; the Windows fd_set structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (Windows: wait_{read,writ}able limited to 512 sockets #3). (This is a nice side-effect of Windows never having had a dense fd space. This also means WSAPoll doesn't have much reason to exist. Unlike other platforms where poll beats select because poll uses an array and select uses a bitmap, WSAPoll is not really any more efficient than select. Its only advantage is that it's similar to how poll works on other platforms... but it's gratuitously incompatible. The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, via QueueUserAPC.)

  • Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.

  • select only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.

Options

Given all of the above, our current design is a hybrid that uses select and non-blocking IO for sockets, with IOCP available when needed. We run select in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur. This is vastly simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modify select's interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g. wait_writable for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.

The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n) select. However, as we can see above, there are a whole list of challenges that would need to be overcome first.

Working around IOCP's limitations

UDP sends

I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.

Spurious extra latency in TCP receives

I think that using the MSG_PUSH_IMMEDIATE flag should solve this.

Checking readability / writability

It turns out that IOCP actually can check readability! It's not mentioned on MSDN at all, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. ref1 (← official MS docs! also note this is ch. 6 of "NPfMW", referenced below), ref2, ref3.

That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with MSG_PEEK set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this doesn't work (MSG_PEEK and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.

What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns EWOULDBLOCK. (This is weird; other platforms don't do this.) If this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.

For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.

Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: ref1, ref2, ref3, ref4, ref5. I guess this is how select is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]

Implementing all this junk

I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809 and its parent.

TODO

If we do want to switch to using IOCP in general, then the sequence would go something like:

  • check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends with MSG_PARTIAL does too while we're at it)
  • check whether you really can do zero-byte reads on UDP sockets like libuv claims
  • figure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't do wait_socket_writable on UDP sockets)

At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:

  • migrate away from select for the cases that can't use IOCP readable/writable checking: [Not necessary, AFD-based select should work for these too]
    • connect
    • accept
  • implement wait_socket_readable and wait_socket_writable on top of IOCP and get rid of select (but at this point we're still doing non-blocking I/O on sockets, just using IOCP as a select replacement)
  • (optional / someday) switch to using IOCP for everything instead of non-blocking I/O

New plan:

  • Use the tricks from the thread below to reimplement wait_socket_{readable,writable} using AFD, and confirm it works
  • Add LSP testing to our Windows CI
  • Consider whether we want to switch to using IOCP in more cases, e.g. send/recv. Not sure it's worth bothering.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions