Synapse workers Complement image results in flaky tests due to inconsistent worker process init

The worker-version of Synapse running in Complement (as described [here](https://github.com/matrix-org/synapse/blob/develop/docker/README-testing.md#testing-with-postgresql-and-single-or-multi-process-synapse)) currently uses [Supervisor](https://supervisord.org) as an init system to start all worker processes in the container. See the current [config template](https://github.com/matrix-org/synapse/blob/557635f69ab734142ae5889e215ea512f7678f21/docker/conf-workers/supervisord.conf.j2) we're using.

This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a `GET /_matrix/client/versions` call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn't started yet, nginx returns a 502, and the test fails.

The result of this is flaky Complement tests - which nobody wants.

I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):

1. Redis
1. the main Synapse process
    * thus all database migrations are handled before any workers start up.
1. all worker Synapse processes
1. nginx

(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to `/_matrix/client/versions` until everything else has started.

Initially I had hoped to use systemd as an init system to replace Supervisor, but [systemd apparently doesn't work in docker containers](https://developers.redhat.com/blog/2019/04/24/how-to-run-systemd-in-a-container). Additionally, we need each process to output its logs to stdout, as otherwise Complement won't be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the [`ENTRYPOINT`](https://github.com/matrix-org/complement/blob/4a297f8d1bc87a8ebf5a4bc4f05918d0420e826d/dockerfiles/SynapseWorkers.Dockerfile#L36-L66) of the docker container, [configure_workers_and_start.py](https://github.com/matrix-org/synapse/blob/7e460ec2a566b19bbcda63bc04b1e422127a99b3/docker/configure_workers_and_start.py#L524), would simply relay.

I don't believe we want to use `synctl` here, as the team has been trying to phase that out for a while now. We could simply do all of this via `subprocess` in `configure_workers_and_start.py`, but I'm hoping there's a better, less manual way. Any ideas? I'd also love to be proven wrong as to whether Supervisor actually *can* do the following:

* Wait for another process to start up before starting the next.
* Healthchecks using HTTP.

@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it's ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I've just had a look, and it looks like we [use a systemd-specific method called `sdnotify`](https://github.com/matrix-org/synapse/blob/8771b1337da9faa3b60cf0ec0a128a7de856f19e/synapse/app/_base.py#L392-L401), which won't be useful here unfortunately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions