Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Events that take >60s to send have a high chance of being duplicated when using an event_creator #3365

@turt2live

Description

@turt2live

Overview

When clients make a request to send an event, the event creator forwards that to the main synapse process. The request to the main process has a 60s timeout, which can be reached under extremely high load. The event creator then returns an error to the client, which then happily retries the request. When the client retries, the steps start all over again. While this is going on, the main process is still trying to send the first event.

Logs

In this scenario, an event was repeated 4 times due to high system load (not uncommon, sadly). The 'client' in this case is the Discord bridge (which uses the js-sdk under the hood, similar to Riot).

Here's what the event_creator sees:

event_creator - 2018-06-06 11:46:23,718 - synapse.access.http.8123 - 97 - INFO - PUT-12029- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:46:23,719 - synapse.util.async - 291 - INFO - PUT-12029- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:46:23,918 - synapse.http.client - 100 - INFO - PUT-12029- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...dle:t2bot.io
event_creator - 2018-06-06 11:47:24,326 - synapse.http.client - 122 - INFO - PUT-12029- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...dle:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:47:24,327 - synapse.util.async - 300 - INFO - PUT-12029- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:47:24,328 - synapse.http.server - 102 - ERROR - PUT-12029- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70ada379e0 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:47:24,328 - synapse.access.http.8123 - 131 - INFO - PUT-12029- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60610ms (3ms, 0ms) (1ms/193ms/3) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:47:26,416 - synapse.access.http.8123 - 97 - INFO - PUT-12044- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:47:26,417 - synapse.util.async - 291 - INFO - PUT-12044- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:47:26,615 - synapse.http.client - 100 - INFO - PUT-12044- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...MLc:t2bot.io
event_creator - 2018-06-06 11:48:26,616 - synapse.http.client - 122 - INFO - PUT-12044- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...MLc:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:48:26,619 - synapse.util.async - 300 - INFO - PUT-12044- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:48:26,620 - synapse.http.server - 102 - ERROR - PUT-12044- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70ab91b638 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:48:26,622 - synapse.access.http.8123 - 131 - INFO - PUT-12044- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60206ms (8ms, 0ms) (1ms/193ms/2) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:48:30,710 - synapse.access.http.8123 - 97 - INFO - PUT-12060- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:48:30,711 - synapse.util.async - 291 - INFO - PUT-12060- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:48:30,972 - synapse.http.client - 100 - INFO - PUT-12060- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...yFb:t2bot.io
event_creator - 2018-06-06 11:49:30,974 - synapse.http.client - 122 - INFO - PUT-12060- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...yFb:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:49:30,977 - synapse.util.async - 300 - INFO - PUT-12060- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:49:30,981 - synapse.http.server - 102 - ERROR - PUT-12060- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70a9f67cf8 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:49:30,982 - synapse.access.http.8123 - 131 - INFO - PUT-12060- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60272ms (12ms, 0ms) (2ms/255ms/2) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:49:39,072 - synapse.access.http.8123 - 97 - INFO - PUT-12070- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:49:39,073 - synapse.util.async - 291 - INFO - PUT-12070- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:49:39,131 - synapse.http.client - 100 - INFO - PUT-12070- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...mFi:t2bot.io
event_creator - 2018-06-06 11:50:38,135 - synapse.http.client - 115 - INFO - PUT-12070- Received response to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...mFi:t2bot.io: 200
event_creator - 2018-06-06 11:50:38,135 - synapse.util.async - 300 - INFO - PUT-12070- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:50:38,136 - synapse.access.http.8123 - 131 - INFO - PUT-12070- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 59064ms (4ms, 3ms) (1ms/49ms/2) 46B 200 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

Most importantly, the same transaction ID is used to send all 4 messages (m1528285583650.0).

The main process sees the following (requests correlated by event ID):

homeserver - 2018-06-06 11:46:23,921 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...dle:t2bot.io
homeserver - 2018-06-06 11:46:23,921 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...dle:t2bot.io], calculating new one
homeserver - 2018-06-06 11:46:23,921 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...dle:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:37,687 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5d86c94200 method=PUT uri=/_synapse/replication/send_event/$152...dle:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:37,687 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 253767ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...dle:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:47:26,618 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...MLc:t2bot.io
homeserver - 2018-06-06 11:47:26,618 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...MLc:t2bot.io], calculating new one
homeserver - 2018-06-06 11:47:26,619 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...MLc:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,086 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5e7a806950 method=PUT uri=/_synapse/replication/send_event/$152...MLc:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:38,086 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 191468ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...MLc:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:48:31,021 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...yFb:t2bot.io
homeserver - 2018-06-06 11:48:31,022 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...yFb:t2bot.io], calculating new one
homeserver - 2018-06-06 11:48:31,023 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...yFb:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,087 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5e6d868cb0 method=PUT uri=/_synapse/replication/send_event/$152...yFb:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:38,087 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 127066ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...yFb:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:49:39,136 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...mFi:t2bot.io
homeserver - 2018-06-06 11:49:39,136 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...mFi:t2bot.io], calculating new one
homeserver - 2018-06-06 11:49:39,138 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...mFi:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,088 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 58952ms (0ms, 0ms) (0ms/0ms/0) 2B 200 "PUT /_synapse/replication/send_event/$152...mFi:t2bot.io HTTP/1.1" "Synapse/0.30.0"

As demonstrated, synapse did eventually finish sending each of the 4 events (with the last one completing fast enough that the client didn't retry).

Possible solution

The timeout between the event creator and main process could be removed, although that has the consequence of other network failures causing potentially duplicate events. Regardless of the timeout being removed, the event creator could send the transaction ID alongside the send_event call to the main process. The main process can then dedupe requests it has already seen, and return the 'real' event ID it ended up sending. The event creator may need to be made aware that the event ID can change as part of the request.

The client retrying the request should already be deduped, although the event creator may need to be smarter about what requests it considers as 'in flight'. For instance, timeouts from the main process should not fail the whole request and instead invoke the event creator to retry on behalf of the client, even though the client may time out and try again independently.

Metadata

Metadata

Assignees

Labels

z-minor(Deprecated Label)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions