large initialsync may never complete

For a sufficiently large account, it's possible that we will never be able to serialise the response before the client times out. This has recently been made a lot worse by #11478 which makes the serialisation process much slower, but it was a potential problem even before that.

The *intention* is that it shouldn't matter too much if it takes a long time to generate a sync response, because once we finally manage it, we will cache the response, which we can then serve up quickly the next time the client makes a request. However, this oversimplifies the situation.

We can basically consider there are four steps in serving a /sync response:

1. Figure out which rooms and events should be in the response. The end result of this step is a `SyncResult` object, which is cached.
2. "Serialise" each event in the response. This is where we turn the events from the internal python objects (`FrozenEvent`) into plain JSON dicts which match the format specified by the C-S API; in particular we add the `unsigned.age` field (which is the main reason this is done after caching); as of #11478 it is also the point at which we add bundled aggregations. The result of this step is a (potentially huge) `JsonDict` (ie, a `dict` containing other simple json-serializable values)
3. JSON-serialise the response dict to a `bytes`.
4. Finally, we start streaming the response to the client.

Since the result of step 1 is cached (at least on `matrix.org`), we can effectively neglect it here (though it is important to remember that it is only cached for 2 minutes... more on that later); so the steps of concern to us are steps 2 and 3, which must complete in less than the client's request timeout.

Some notes on timeouts:
 * Cloudflare times out requests if the server doesn't start sending its response within 100 seconds of the request.
 * However, haproxy doesn't notice this until we start trying to send back the response - so Synapse still sees the request as active until our haproxy times out the request, which happens after 180 seconds.
 * Before we start each of steps 2 and 3, we check that the client connection is still up, and skip that step if not. (Step 4 is aborted as soon as the connection drops.)

**In short**: we need to serialise the `FrozenEvent`s to `dict`s (step 2) and JSON-serialise the `dict`s to `bytes` within 100 seconds. 

That should be plenty though, right? Wrong.

---

The first problem is that #11478 makes step 2 take *224* seconds (assuming all the relevant aggregations are already cached from the database. The first time through, when we have do database activity it takes over 800 seconds):

![image](https://user-images.githubusercontent.com/1389908/146765239-29585531-ccb4-40e0-9a1e-f0355e10043e.png)

https://jaeger.int.matrix.org/trace/1a723cd9877a1cc9

(Note that in the above trace, the client connection drops at 180 seconds, so we skip the JSON-serialisation step.)

Still, even with that reverted, we have problems. Consider this example:

![image](https://user-images.githubusercontent.com/1389908/146769595-66e3889d-4088-41af-8f1e-99ee08b43c16.png)

https://jaeger.int.matrix.org/trace/2600e970dda6bc65

This request reached haproxy at 12:37:09, but we didn't start processing it until 12:37:56 (a delay of 47 seconds). We then manage encode the response in 58 seconds, and start streaming the response at 12:38:52, but this doesn't reach haproxy until 12:39:37 (a further 45 seconds). By that time, a total of 148 seconds has elapsed, and Cloudflare has long gone.

It's hard to trace those 47s and 45s delays precisely, but event serialisation is a *synchronous* process which blocks the reactor for up to a minute (eg, the preceding request has an `encode_response` call which blocks the reactor between 12:36:33 and 12:37:29) - after which we have a bunch of other work to catch up on. Prior experience suggests that all bets are off once the reactor starts ticking that slowly:

![image](https://user-images.githubusercontent.com/1389908/146773420-a676f4ff-dcc1-4c78-a7c5-3f4f89764ffb.png)

---

Ideas on things which might make this better:
 * Counter-intuitively, having a long timeout in haproxy might be making things *worse*, because it means we're doing work serving responses that will be thrown away at Cloudflare. Possibly we should decrease the timeout in haproxy to (say) 120 seconds.
 * We certainly shouldn't be blocking the reactor for 45 seconds at a time; at the very least, we should stick some `sleep(0)` calls in `sync.encode_response`.
 * It would be great to find a way to put most of the event serialisation work inside the `SyncResponse` cache.
   * Alternatively, maybe we can do it more lazily, so that we can start JSON-encoding without having to event-serialise the entire response.
 * #10905 is also part of the problem here: See discussion at https://github.com/matrix-org/synapse/pull/10905#issuecomment-929010509).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

large initialsync may never complete #11611

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

large initialsync may never complete #11611

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions