Windows event notification

# Problem

Windows has 3 incompatible families of event notifications APIs: IOCP, `select`/`WSAPoll`, and `WaitForMultipleEvents`-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you *must use all 3 of these*. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.

(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO – and I'm ignoring GUI events entirely because Trio isn't a GUI library. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't *wrong* but does require taking a certain narrow perspective...)

# Considerations

## The WaitFor*Event family

The `Event`-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to use `WaitForSingleObject(Ex)` and allocate one-thread-per-event, or else use `WaitForMultipleObjects(Ex)` and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233

## IOCP

IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)

Issues:

* When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you *must* implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. [ref1](https://bugs.chromium.org/p/chromium/issues/detail?id=442392), [ref2](https://groups.google.com/a/chromium.org/forum/#!topic/net-dev/VTPH1D8M6Ds).)

* When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. [ref1](https://bugs.chromium.org/p/chromium/issues/detail?id=30144#), [ref2](https://bugs.chromium.org/p/chromium/issues/detail?id=86515#))

* Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.

* Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding *what* bytes to write until the last moment.

* Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.

On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's [also possible to use a socket if you set it up right](http://stackoverflow.com/a/5725609).)

## select/WSAPoll

You can also use `select`/`WSAPoll`. This is the only documented way to check if a socket is readable/writable. However:

* As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop, `select.select` on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference...

* `select.select` is limited to 512 sockets, but this is trivially overcome; the Windows `fd_set` structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (#3). (This is a nice side-effect of Windows never having had a dense fd space. This also means `WSAPoll` doesn't have much reason to exist. Unlike other platforms where `poll` beats `select` because `poll` uses an array and `select` uses a bitmap, `WSAPoll` is not really any more efficient than `select`. Its only advantage is that it's similar to how poll works on other platforms... but it's [gratuitously incompatible](https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/). The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, via `QueueUserAPC`.)

* Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.

* `select` only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.

# Options

Given all of the above, our current design is a hybrid that uses `select` and non-blocking IO for sockets, with IOCP available when needed. We run `select` in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur.  This is *vastly* simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modify `select`'s interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g. `wait_writable` for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.

The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n) `select`. However, as we can see above, there are a whole list of challenges that would need to be overcome first.

## Working around IOCP's limitations

### UDP sends

I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.

### Spurious extra latency in TCP receives

I *think* that using the `MSG_PUSH_IMMEDIATE` flag should solve this.

### Checking readability / writability

It turns out that IOCP actually can check readability! It's not mentioned on MSDN *at all*, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. [ref1](https://web.archive.org/web/20150430100627/https://www.microsoft.com/mspress/books/sampchap/5726a.aspx) (← official MS docs! also note this is ch. 6 of "NPfMW", referenced below), [ref2](https://stackoverflow.com/questions/4988168/wsarecv-and-wsabuf-questions), [ref3](http://microsoft.public.win32.programmer.networks.narkive.com/l68NhvSm/wsarecv-iocp-when-exactly-is-the-notification-sent).

That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with `MSG_PEEK` set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this *doesn't* work (`MSG_PEEK` and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.

What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns `EWOULDBLOCK`. (This is weird; other platforms don't do this.) **If** this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.

For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.

Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: [ref1](https://github.com/piscisaureus/epoll_windows/blob/master/src/epoll.c#L754), [ref2](https://groups.google.com/forum/#!topic/libuv/S4U_JjbxW9M), [ref3](http://mista.nu/blog/?p=655), [ref4](https://www.osronline.com/showthread.cfm?link=134510), [ref5](https://gist.github.com/daurnimator/63d2970aedc952f0beb3). I guess this is how `select` is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]

### Implementing all this junk

I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809c51d05729011506bc9de38cd97a35be44 and its parent. 

# TODO

If we do want to switch to using IOCP in general, then the sequence would go something like:

* [ ] ~~check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends with `MSG_PARTIAL` does too while we're at it)~~
* [ ] ~~check whether you really can do zero-byte reads on UDP sockets like libuv claims~~
* [ ] ~~figure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't do `wait_socket_writable` on UDP sockets)~~

~~At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:~~

* [ ] ~~migrate away from `select` for the cases that can't use IOCP readable/writable checking:~~ [Not necessary, AFD-based `select` should work for these too]
  * [ ] ~~connect~~
  * [ ] ~~accept~~
* [ ] ~~implement `wait_socket_readable` and `wait_socket_writable` on top of IOCP and get rid of `select` (but at this point we're still doing non-blocking I/O on sockets, just using IOCP as a `select` replacement)~~
* [ ] ~~(optional / someday) switch to using IOCP for everything instead of non-blocking I/O~~

New plan:

* [ ] Use the tricks from the thread below to reimplement `wait_socket_{readable,writable}` using AFD, and confirm it works
* [ ] Add [LSP testing](https://github.com/piscisaureus/wepoll/blob/0857889a0fccdba62654e11ae743dd96d85cc711/.appveyor.yml#L73-L95) to our Windows CI
* [ ] Consider whether we want to switch to using IOCP in more cases, e.g. send/recv. Not sure it's worth bothering.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Windows event notification #52

Problem

Considerations

The WaitFor*Event family

IOCP

select/WSAPoll

Options

Working around IOCP's limitations

UDP sends

Spurious extra latency in TCP receives

Checking readability / writability

Implementing all this junk

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Windows event notification #52

Description

Problem

Considerations

The WaitFor*Event family

IOCP

select/WSAPoll

Options

Working around IOCP's limitations

UDP sends

Spurious extra latency in TCP receives

Checking readability / writability

Implementing all this junk

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions