-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065
Description
The worker-version of Synapse running in Complement (as described here) currently uses Supervisor as an init system to start all worker processes in the container. See the current config template we're using.
This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a GET /_matrix/client/versions
call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn't started yet, nginx returns a 502, and the test fails.
The result of this is flaky Complement tests - which nobody wants.
I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):
- Redis
- the main Synapse process
- thus all database migrations are handled before any workers start up.
- all worker Synapse processes
- nginx
(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to /_matrix/client/versions
until everything else has started.
Initially I had hoped to use systemd as an init system to replace Supervisor, but systemd apparently doesn't work in docker containers. Additionally, we need each process to output its logs to stdout, as otherwise Complement won't be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the ENTRYPOINT
of the docker container, configure_workers_and_start.py, would simply relay.
I don't believe we want to use synctl
here, as the team has been trying to phase that out for a while now. We could simply do all of this via subprocess
in configure_workers_and_start.py
, but I'm hoping there's a better, less manual way. Any ideas? I'd also love to be proven wrong as to whether Supervisor actually can do the following:
- Wait for another process to start up before starting the next.
- Healthchecks using HTTP.
@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it's ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I've just had a look, and it looks like we use a systemd-specific method called sdnotify
, which won't be useful here unfortunately.